Achieving truly intelligent agents requires equipping them with the ability to interact effectively with the real world, a feat demanding robust function-calling capabilities. Runnan Fang, Shihao Cai, and Baixuan Li, along with their colleagues, address this challenge by demonstrating that the breadth of an agent’s competence directly correlates with the diversity of the environments in which it trains. Their work introduces a novel framework, AgentScaler, which automatically generates a wide range of simulated environments, systematically expanding the scenarios an agent encounters. By combining this scalable environment construction with a two-stage agent training process, the team significantly improves function-calling performance on established agentic benchmarks, representing a key step towards building more versatile and generally intelligent agents.
AI Agents Learning to Use Tools
Researchers are making significant progress in building artificial intelligence agents capable of effectively using external tools, such as APIs and search engines, to solve complex tasks. This work focuses on developing methods for generating high-quality training data, teaching models to learn tool use, and establishing robust benchmarks to assess agent performance. Open-source tools are also being created to facilitate the development of these intelligent agents. A major emphasis lies on data synthesis and augmentation, moving beyond reliance on human-created datasets. Scientists are now using AI agents themselves to generate training data, simulating interactions between agents, humans, and tools.
This includes creating datasets that require extended interactions with tools, rather than single-step operations. Graph-based approaches are also employed, using graphs to represent relationships between tasks, tools, and information to guide data generation. Reinforcement learning techniques, including ToolRL and iterative reinforced fine-tuning, are used to train agents to optimize tool use, with reward shaping encouraging effective strategies. Several benchmarks are crucial for evaluating progress, including TAU-Bench, which focuses on real-world tool-agent-user interactions, and Webwalker, which assesses web traversal capabilities.
ToolHop evaluates multi-hop tool use, while other benchmarks assess interaction with REST APIs and tool-using language models. Projects like xLAM, a family of large action models, and Apigen, an agentic pipeline for data generation, are driving innovation. Other notable models include Kimi K2 and ToolFormer, which learn to use tools, and RestGPT, which connects language models with real-world APIs. Current challenges include the scarcity of high-quality training data, the need for complex reasoning and planning with tools, and improving the ability of agents to generalize to new tools and tasks. Researchers are also working to improve robustness and develop more comprehensive evaluation benchmarks. Ultimately, this research aims to create AI agents that are not just language models, but active problem-solvers capable of leveraging external tools to achieve complex goals.
Function Diversity via Environmental Database Operators
Scientists have engineered a scalable framework to advance general agentic intelligence by systematically expanding the diversity of environments used to train artificial agents. Recognizing that robust function-calling capabilities require experience in varied settings, the study pioneered a two-stage pipeline for both environment construction and agent experience learning. This approach addresses limitations in existing synthetic data generation methods, which often lack realism or require significant manual intervention. The core principle is that each function call represents a read or write operation on an underlying environmental database.
Researchers implemented this by assigning each function an operator type, categorizing it as either querying the database or inducing state transitions within it. Tools were then instantiated as executable code, enabling direct programmatic interaction with these database structures, and organized into domains using community detection techniques. This allows for the creation of diverse environments aligned with specific database structures, and the generation of tool sequences with sampled parameters to initialize database states. The system grounds tool executions directly on the database, ensuring verifiability at both the environment and tool-argument response levels.
For agent training, researchers adopted a two-stage learning framework. Initially, agents acquire fundamental tool-calling skills across general domains through simulated human-agent interactions, collecting experiential data. Strict filtering is then applied to refine this data. Subsequently, agents are further trained within target vertical domains using domain-specific scenarios, fostering smoother and more context-aligned development of agentic capabilities. Extensive experiments were conducted on benchmarks including τ-bench, τ2-Bench, and ACEBench, demonstrating the effectiveness of the pipeline.
Based on this methodology, the team trained a family of AgentScaler models, 4B, 8B, and 30B-A3B, built upon the Qwen-3 series. These models achieved state-of-the-art performance at comparable scales, with the AgentScaler-30-A3B model achieving results on par with existing 1 trillion-parameter models, despite having significantly fewer parameters. The study also provides a systematic analysis of model generalization, stability, and long-horizon tool-calling, offering key insights into the development of general agentic intelligence.
AgentScaler Broadens Function-Calling Environment Diversity
Scientists have developed a new framework, AgentScaler, to significantly enhance the function-calling capability of agentic systems, a crucial step towards deploying advanced artificial intelligence in real-world applications. The research addresses the challenge of scaling environments for training these agents, systematically broadening the range of scenarios they can handle. The team designed a pipeline that automatically constructs diverse, fully simulated environments, enabling agents to learn from a wider variety of interactions. This work centers on a novel approach to environment building, interpreting each function call as a read or write operation on an underlying database.
The team constructed over 1,000 distinct domains, each with a specific database structure, and formalized tools as Python code to interact with these databases. They began with a collection of over 30,000 APIs, refined through filtering and rewriting to include explicit input-output specifications, and then systematically exploited relationships between APIs to create complex tool compositions. Experiments demonstrate that AgentScaler models, including 4B, 8B, and 30B-A3B parameter versions built upon the Qwen-3 series, achieve state-of-the-art performance on agentic benchmarks, specifically τ-bench, τ2-Bench, and ACEBench. Notably, the AgentScaler-30-A3B model achieves performance comparable to existing 1 trillion-parameter models and leading closed-source systems, but with significantly fewer parameters. The team’s systematic analysis also provides key insights into model generalization, stability, and the challenges of long-horizon tool-calling, furthering the development of general agentic intelligence.
Scalable Environments Drive Agentic Intelligence Gains
This work presents a systematic pipeline for advancing general agentic intelligence through scalable environment construction and agent experience learning. Researchers developed a framework that automatically generates diverse, fully simulated environments, enabling the creation of large datasets of verifiable interactions for training language agents. This approach addresses the critical need for robust function-calling capabilities in agents intended for real-world applications, a challenge closely linked to the variety of training environments. The team implemented a two-stage agent fine-tuning strategy, first establishing fundamental tool-usage skills and then specializing agents for specific contexts.
Extensive experiments on established benchmarks demonstrate that the resulting AgentScaler models achieve state-of-the-art performance among open-source models with under one trillion parameters, and in some cases, match the performance of much larger or closed-source alternatives. This highlights the importance of both scalable environments and verifiable agentic experience in developing robust and generalizable language agents. Researchers acknowledge that the current work is limited to models with 30 billion parameters.
👉 More information
🗞 Towards General Agentic Intelligence via Environment Scaling
🧠 ArXiv: https://arxiv.org/abs/2509.13311
