The development of increasingly capable artificial intelligence agents hinges on access to realistic training data, yet a significant bottleneck currently limits progress in the open-source community. To address this challenge, Zhangchen Xu, Adriana Meza Soria, Shawn Tan, and colleagues from MIT-IBM Watson AI Lab, alongside Radha Poovendran from the University of Washington, present Toucan, a new dataset comprising 1. 5 million tool-agentic trajectories. Toucan distinguishes itself by synthesising data from nearly 500 authentic, real-world environments, creating diverse and challenging tasks that involve genuine tool execution, unlike existing datasets often limited in complexity and realism. This achievement not only provides a substantially larger resource for training AI agents, but also demonstrably improves performance, with models trained on Toucan surpassing larger, closed-source alternatives on established benchmarks and advancing the state-of-the-art in complex task completion.

Server Configuration Complexity and Tool Selection

This research investigates the challenges of understanding complex server configurations, tool dependencies, and translating high-level requests into specific tool calls. The difficulty arises from the potential for multiple valid configurations and the need to accurately infer user intent from limited information, requiring strong domain knowledge of server administration, networking, and available tools. While alternative tools may exist, they often require different approaches or lack necessary capabilities, making the intended tool essential for completing the task. The question is well-formed, though it lacks specific details about the server environment, user preferences, or desired configuration, potentially leading to multiple interpretations.

The research highlights the importance of clear information architecture, suggesting that providing more context or constraints would improve clarity. The scenario accurately reflects the realistic tasks faced by system administrators and DevOps engineers, grounding the work in practical use cases. Verifying the answer requires access to a server and the ability to apply the configuration, confirming its functionality, though incomplete or inaccurate configurations may be difficult to verify without additional information.

Generating Diverse Agentic Trajectories with TOUCAN

Researchers developed TOUCAN, a substantial dataset of 1. 5 million tool-agentic trajectories, to overcome limitations in existing training data for artificial intelligence agents. This dataset was generated using nearly 500 real-world Context Protocols, ensuring diversity, realism, and complexity in the tasks presented to the agents. The methodology involves a pipeline that creates a broad range of tool-use queries, applies quality filtering, and generates agentic trajectories using three teacher models within two distinct agentic frameworks. Rigorous validation, employing both rule-based and LLM-based techniques, guarantees the high quality of the generated data.

To further enhance the dataset’s robustness, the team implemented three extension mechanisms, including introducing constraints to create variations on existing tasks, simulating multi-turn conversations by splitting complex tasks into sequential sub-questions, and tightening filters to focus on relevant tool calls. Detailed statistical analysis reveals TOUCAN’s comprehensive coverage of multi-server and multi-tool tasks, as well as the prevalence of multi-turn conversations. Experiments demonstrate that models fine-tuned on TOUCAN significantly outperform larger closed-source counterparts on the BFCL V3 benchmark and advance the performance frontier on the MCP-Universe Bench, improving the performance of Qwen2. 5-7B-Instruct by 3.

16%, Qwen2. 5-14B-Instruct by 7. 40%, and Qwen2. 5-32B-Instruct by 8. 72%. LLM-based quality assessment confirms that TOUCAN exhibits high question quality, scenario realism, and a balanced range of task difficulties, establishing it as a valuable resource for training and evaluating advanced AI agents.

Toucan Dataset Enables Realistic Tool-Using AI Agents

Scientists have introduced Toucan, a groundbreaking dataset comprising 1. 5 million trajectories designed to enhance the capabilities of tool-using artificial intelligence agents. This work addresses a critical need for high-quality, permissively licensed data to train more capable agentic LLMs, previously constrained by limited and often unrealistic datasets. Unlike prior efforts relying on simulated environments or restricted toolsets, Toucan leverages nearly 500 real-world Model Context Protocol (MCP) servers, providing access to over 2,000 tools and generating diverse, realistic tasks. The research team developed a pipeline that begins by generating a broad spectrum of tool-use tasks using five distinct models, followed by rigorous quality filtering to ensure relevance and difficulty.

Agentic trajectories are then generated using three teacher models, incorporating both rule-based and model-based checks to verify tool execution and response accuracy. This pipeline also includes extensions to create additional tasks targeting edge-case scenarios, interactive conversations, and multi-turn dialogues, significantly expanding the dataset’s complexity. Experiments demonstrate that models fine-tuned on Toucan surpass closed-source counterparts on the BFCL V3 benchmark, achieving superior performance in function calling accuracy across both single-turn and multi-turn scenarios. Furthermore, these models show substantial improvements on both τ-Bench and τ2-Bench, with gains in tool selection, execution fidelity, and multi-turn reasoning under dynamic user interactions. On the recent MCP-Universe benchmark, Toucan-tuned models achieve state-of-the-art performance within their parameter class, consistently outperforming leading models of comparable size, confirming Toucan’s effectiveness in enhancing LLM agentic capabilities and providing a valuable open-source resource for the field.

Toucan Dataset Advances Tool-Using Language Agents

This work introduces Toucan, a substantial dataset comprising 1. 5 million trajectories designed to enhance the training of tool-using language agent models. The team developed a comprehensive pipeline to generate this data, leveraging nearly 500 real-world Context Protocols to create diverse and challenging tasks involving actual tool execution. Results demonstrate that models fine-tuned on Toucan achieve improved performance on established benchmarks, including BFCL-V3 and MCP-Universe, and advance the current state-of-the-art in this field. The creation of Toucan represents a significant step towards building more robust and capable language agents.

Through rigorous validation techniques, the researchers ensured the quality and reliability of the generated data, addressing a critical need within the open-source community. Ablation studies confirm that each component of the data generation pipeline contributes to overall performance gains. The authors acknowledge certain limitations, including the dataset’s snapshot in time and the exclusion of servers requiring complex configurations. Future work will focus on expanding the dataset to include more servers and developing expert models capable of simulating tool responses, reducing the cost and complexity of data generation. They also plan to create a dedicated benchmark focused on web search capabilities, further advancing the development of tool-using language agents.

👉 More information
🗞 TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments
🧠 ArXiv: https://arxiv.org/abs/2510.01179

Tags:

-based validation Agentic Frameworks BFCL V3 benchmark Context Protocols LLM Agents MCP-Universe Bench multi-turn interactions Pareto frontier rule-based validation tool-agentic data

Toucan Synthesizes 1.5M Tool-Agentic Trajectories from 500 Real-World MCP Environments for Enhanced Agent Training

Server Configuration Complexity and Tool Selection

Generating Diverse Agentic Trajectories with TOUCAN

Toucan Dataset Enables Realistic Tool-Using AI Agents

Toucan Dataset Advances Tool-Using Language Agents

Rohail T.

Latest Posts by Rohail T.:

Lasers Unlock New Tools for Molecular Sensing

Light’s Polarisation Fully Controlled on a Single Chip

New Quantum Algorithms Deliver Speed-Ups Without Sacrificing Predictability