Researchers are tackling the complex challenge of creating interactive agents capable of seamlessly using tools and engaging in multi-turn conversations to solve real-world problems. Jiaxuan Gao from Tsinghua University, Jiaao Chen from Eigen AI, and Chuyi He, along with Wei-Chen Wang, Shusheng Xu, and Hanrui Wang et al. from Eigen AI, present a novel framework, EigenData, that addresses the difficulties of training such agents. Their work is significant because it combines self-evolving synthetic data generation with a verifier-based reinforcement learning approach, offering a scalable pathway to develop complex tool-using behaviours without relying on costly human annotation, and achieving state-of-the-art results on benchmark tasks like tau^2-bench.
EigenData synthesis and reinforcement learning for tool-using agents offer promising results
Scientists have developed a novel framework for training interactive agents that utilise tools, achieving significant performance gains in complex task completion. The research addresses the challenges of creating agents capable of multi-turn interactions with both humans and external environments, requiring precise dialogue state tracking and multi-step tool execution.
Researchers propose EigenData, a hierarchical multi-agent system that autonomously synthesises tool-grounded dialogues alongside executable checkers, improving data reliability through a closed-loop self-evolving process that refines prompts and workflows. Building upon this synthetic data, the team developed a reinforcement learning (RL) recipe that initially fine-tunes the user model before employing GRPO-style training with trajectory-level group-relative advantages and dynamic filtering.
This approach yields consistent improvements over standard supervised fine-tuning, overcoming the limitations of noisy signals often encountered in RL due to user simulation. Experiments conducted on the tau^2-bench revealed that their best model attained 73.0% pass rate on the Airline task and 98.3% pass rate on the Telecom task, matching or surpassing the performance of existing frontier models.
The core innovation lies in the combination of self-evolving data generation and verifiable-reward RL, enabling the bootstrapping of complex tool-using behaviours without relying on extensive and costly human annotation. EigenData’s hierarchical structure allows for the creation of intricate tasks and coherent simulated user interactions, while the executable checkers provide robust reward signals for RL training.
This system effectively addresses the bottlenecks of scalable data acquisition and unstable user simulation, paving the way for more reliable and efficient post-training of interactive agents. The study establishes a pathway towards creating agents that can seamlessly integrate into real-world applications, handling tasks such as flight changes or service requests with a high degree of accuracy and adaptability.
By achieving state-of-the-art results with fully open-weight models, this work demonstrates the potential for democratising access to advanced AI capabilities in interactive tool use, reducing the dependence on proprietary technologies and large-scale human datasets. The research suggests a scalable solution for building intelligent agents capable of complex, multi-turn interactions in dynamic environments.
EigenData construction and reinforcement learning via Group-Relative Policy Optimisation offer promising results
Scientists developed EigenData, a hierarchical multi-agent system, to address challenges in post-training interactive tool-using agents. The research team engineered a self-evolving data agent that synthesises tool-grounded dialogues alongside executable per-instance checkers, enabling reliable generation through a closed-loop self-evolving process.
This process iteratively updates both the dialogue generation and the verification workflow, improving data quality and consistency. Building upon this synthetic data, researchers devised a reinforcement learning (RL) recipe that initially fine-tunes the user model. Subsequently, they applied Group-Relative Policy Optimisation (GRPO)-style training, utilising trajectory-level group-relative advantages and dynamic filtering to enhance training efficiency.
This approach achieves consistent improvements beyond simple Supervised Fine-Tuning (SFT) methods by focusing on collective reward signals. Experiments employed the tau^2-bench evaluation platform, where the best model attained 73.0% pass rate on the Airline task and 98.3% pass rate on the Telecom task.
These results demonstrate performance matching or exceeding that of current frontier models in complex, multi-turn interactions. The study pioneered a method for generating realistic user profiles, varying experience levels from novice to expert, and incorporating backgrounds to create diverse scenarios.
To rigorously test agent capabilities, the team defined specific testing objectives including sequential decision-making, multi-constraint verification, deception detection, and edge-case navigation. They also implemented a suite of deception tactics, such as false status claims and emotional manipulation, to assess the agent’s robustness.
The TaskValidationAgent parses task instructions, maps goals to required functions, and verifies preconditions through tool execution, including calls to ‘get user details’ and ‘get reservation details’. This validation methodology ensures that every generated task is completable within the simulated environment, bolstering the reliability of the training data and the resulting agent performance.
EigenData achieves superior performance on benchmark tool-use tasks via self-evolving synthetic data generation and iterative refinement
Scientists achieved significant breakthroughs in generating high-quality multi-turn tool-use data through a self-evolving data agent called EigenData. The team measured the performance of their system on the τ²-bench, a standardized evaluation platform for interactive tool-using agents. Results demonstrated that EigenData could synthesize complex dialogues and executable verification functions, improving generation reliability via a closed-loop self-evolving process.
Experiments revealed that EigenData’s hierarchical multi-agent framework significantly enhanced the quality of synthetic data. The best model reached 73.0% pass^1 on Airline tasks and 98.3% pass^1 on Telecom tasks, outperforming existing models. These precise values confirm the effectiveness of combining self-evolving data synthesis with reinforcement learning (RL) for interactive tool-use agents.
Tests prove that EigenData’s approach can consistently improve performance beyond simple fine-tuning (SFT). By first fine-tuning the user model and then applying GRPO-style training, the researchers achieved consistent improvements. The breakthrough delivers a scalable pathway for bootstrapping complex tool-using behaviors without relying on expensive human annotation.
EigenData facilitates self-improving tool use via synthetic dialogue and reinforcement learning, ultimately enhancing agent capabilities
Scientists have developed a new framework for training interactive agents that utilise tools, addressing challenges in scaling data synthesis and reinforcement learning. The research introduces EigenData, a hierarchical multi-agent system designed to generate tool-grounded dialogues alongside executable checkers, facilitating a self-evolving process to enhance data reliability.
This synthetic data is then used in a reinforcement learning process that fine-tunes user models and employs group-relative advantages with dynamic filtering, consistently improving performance beyond supervised fine-tuning. Evaluations on the tau^2-bench benchmark demonstrate the effectiveness of this approach, with the best model achieving 73.0% pass rate on the Airline task and 98.3% on Telecom, matching or surpassing existing leading models.
The findings suggest a viable pathway for creating complex tool-using agents without extensive human annotation, potentially lowering the barrier to entry for developing capable agents in areas like customer support and workflow automation. Authors acknowledge limitations related to the constrained benchmark environments used for training and evaluation, and highlight the need for careful monitoring and responsible deployment to mitigate potential risks associated with increased tool-use competence. Future work should focus on incorporating stricter permissioning, auditing, and policy enforcement for tool access, alongside continued monitoring of societal impacts as these systems become more advanced.
👉 More information
🗞 From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
🧠 ArXiv: https://arxiv.org/abs/2601.22607
