Researchers are tackling the challenge of reliably training autonomous agents with the introduction of Agent World, a novel pipeline for generating fully synthetic environments. Zhaoyang Wang from the University of North Carolina at Chapel Hill, Canwen Xu, and Boyi Liu from Snowflake, alongside Yite Wang, Siwei Han, and Zhewei Yao et al., present a system capable of creating 1,000 diverse everyday scenarios for agent training. This work represents a significant step forward because it addresses the limitations of current agent training methods, which struggle with a lack of varied and dependable environments. By utilising code-driven, database-backed simulations, Agent World provides consistent state transitions and efficient agent interaction, ultimately enabling strong out-of-distribution generalisation as demonstrated across three benchmarks.
This breakthrough addresses a critical limitation in the field of artificial intelligence: the scarcity of diverse and reliable environments needed to scale agent training.
Unlike existing methods that rely on expensive real-world data or potentially unreliable LLM-simulated environments, AWM creates code-driven environments backed by databases, ensuring consistent and predictable state transitions. These environments empower agents to interact with an average of 35 tools each, facilitating complex multi-turn interactions and high-quality observation gathering.
The core innovation lies in AWM’s systematic approach to environment synthesis, mirroring established software development practices. Beginning with a high-level scenario description, the pipeline generates user requirements and a corresponding database schema. This schema then guides the creation of a robust toolset and backend code, guaranteeing a clear data model for each tool.
A unified interface, utilising the Model Context Protocol, allows agents to interact seamlessly with the environment, while automated verification code provides reliable reward signals for reinforcement learning. This automated execution and self-correction process allows for scalable environment creation.
To demonstrate the effectiveness of this resource, researchers performed large-scale reinforcement learning with multi-turn tool-use agents. The fully executable environments and accessible database states enabled the design of robust reward functions, leading to significant improvements in agent performance.
Experiments conducted on three established benchmarks reveal that agents trained exclusively within these synthetic environments exhibit strong out-of-distribution generalisation capabilities, surpassing performance achieved with benchmark-specific training. This suggests a pathway towards creating more versatile and adaptable AI agents capable of tackling real-world challenges.
Synthetic Environment Construction via Large Language Models and Database-Driven State Management
A 72-qubit superconducting processor forms the foundation of the Agent World (AWM) pipeline, a fully synthetic environment generation system designed to scale agent training. The research team constructed a pipeline to generate 1,000 diverse environments representing everyday scenarios, each equipped with an average of 35 tools for agents to interact with.
These environments are code-driven and utilise databases to ensure reliable and consistent state transitions, a departure from environments relying on LLM-simulated responses. The methodology begins with scenario generation, leveraging large language models to create descriptions of stateful applications such as e-commerce platforms and CRM systems.
A filtering pipeline, incorporating an LLM-based classifier and embedding-based deduplication, ensures the selection of scenarios involving core CRUD operations and maintains diversity within the 1,000 generated scenarios. This initial stage focuses on establishing a broad range of potential interaction contexts for the agents.
Following scenario creation, the work proceeds to task synthesis and database design. The system generates task sets for each environment, then designs a corresponding database to define the state space. Data is synthesised to populate this database, providing the initial state for agent interaction and enabling grounded feedback during reinforcement learning.
This database-backed approach is central to ensuring consistency and reliability in the simulated environment. A key innovation lies in the implementation of code-augmented verification. The research team designed verification code for each task, allowing for reliable reward function design and objective evaluation of agent performance.
This verification process, coupled with the executable environments, facilitates multi-turn reinforcement learning for tool-use agents. The resulting AWM dataset comprises 35,062 tools and 10,000 tasks, representing the largest open-source tool-use environment set currently available.
Synthetic environment generation supports robust multi-tool agent generalisation
Researchers developed Agent World (AWM), a pipeline for generating fully synthetic environments, scaling to 1,000 diverse environments for autonomous agent training. These environments incorporate an average of 35 tools per environment, enabling agents to interact with rich toolsets and receive high-quality observations.
The core of AWM lies in code-driven environments backed by databases, ensuring more reliable and consistent state transitions than those relying on large language model simulations. The study demonstrates large-scale reinforcement learning with multi-turn tool-use agents, utilising 1,024 environment instances per step.
Accessible database states facilitated the design of reliable reward functions, crucial for effective agent training. Experiments were conducted across three benchmarks, revealing that agents trained solely within these synthetic environments achieved strong out-of-distribution generalisation capabilities.
AWM synthesises environments as partially observable Markov decision processes, each comprising a state space, action space, observation space, transition function, and task-specific reward functions. The pipeline progresses through scenario synthesis, task creation, database design, interface synthesis, and verification, culminating in fully executable environments suitable for online reinforcement learning.
This approach avoids reliance on existing task sets or API documentation, mitigating potential copyright concerns. The generated environments feature database-backed state management, enforcing consistency and enabling code-augmented verification for reinforcement learning applications. To date, AWM represents the largest open-source tool-use environment set, comprising 1,000 environments, 35,062 tools, and 10,000 tasks paired with corresponding verification code. This extensive resource provides a robust platform for training and evaluating agents in complex, realistic scenarios.
Synthetic Environments Enhance Tool Use Agent Generalisation
Researchers have developed Agent World Model, a scalable pipeline for creating executable environments used to train agents capable of utilising tools. This pipeline successfully generated 1,000 diverse environments, each containing an average of 35 tools and 10,000 tasks, facilitating large-scale reinforcement learning for multi-turn tool-use agents.
These environments are constructed using code and supported by SQL databases, ensuring reliable and consistent state transitions, and enabling more efficient agent interaction than relying on realistic environments. The significance of this work lies in its ability to improve out-of-distribution generalisation in agents.
Experiments across three benchmarks demonstrate that training agents exclusively within these synthetic environments yields strong performance in unseen scenarios, surpassing both training via large language model simulations and concurrent synthesis methods. The authors acknowledge limitations including constraints on computing resources, which restricted training to 526 of the 1,000 generated environments, and a focus on the Qwen3 model family at 4B, 8B, and 14B scales.
Future research directions include incorporating a self-evolving paradigm where trained agents contribute to environment synthesis, and optimising the synthesis pipeline with proactive error detection using large language models, potentially augmented by human inspection. The creation of 1,000 synthesised environments and the scalable pipeline represent a valuable resource for the research community, though caution is advised when deploying agents trained on synthetic data to real-world applications.
👉 More information
🗞 Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2602.10090
