Large language model (LLM) agents currently demand significant manual configuration and struggle to adapt to changing circumstances, hindering their widespread application. Yuchen Shi, Yuzheng Cai, and Siqi Cai, all from TencentCloudADP, alongside Zihan Xu, Lichao Chen and Yulei Qin, present Youtu-Agent, a novel framework designed to overcome these limitations through automated generation and continuous learning. This research introduces a modular system that separates key agent components, allowing for flexible reuse and automatic construction of both workflows and more complex agent configurations. Crucially, Youtu-Agent incorporates a hybrid policy optimisation system, enabling agents to learn from experience and improve performance without constant, costly fine-tuning. Demonstrating state-of-the-art results on benchmarks like WebWalkerQA and GAIA, and achieving significant speedups in training, Youtu-Agent represents a substantial step towards scalable and adaptable LLM agents.

Scientists engineered a structured configuration system that separates execution environments, toolkits, and context management, facilitating flexible component reuse and automated agent synthesis. Two distinct generation paradigms were introduced: a Workflow mode for standard tasks and a -Agent mode for complex, non-standard challenges, enabling the automatic generation of both tool code and prompting configurations. To facilitate agent learning, the study pioneered an innovative hybrid policy optimization system comprised of two key modules.

The Agent Practice module allows agents to accumulate experience and enhance performance through in-context optimization, crucially without requiring parameter updates to the underlying LLM. Complementing this, the Agent RL module integrates with distributed training frameworks, enabling scalable and stable reinforcement learning for any Youtu-Agent in a large-scale, end-to-end manner. Experiments utilising the ReAct framework with a code interpreter tool demonstrated state-of-the-art performance on the WebWalkerQA benchmark, achieving a score of 71.47%, and on GAIA, reaching 72.8%, both with open-weight models. The automated generation pipeline achieved an 81% success rate in tool synthesis, while the Practice module demonstrably improved performance on the AIME 2024 and 2025 datasets by +2.7% and +5.4% respectively.

Furthermore, the Agent RL training achieved a 40% speedup with consistent performance gains on 7B LLMs, enhancing coding and reasoning capabilities by up to 35% and improving search performance by 21% on Maths and general/multi-hop QA benchmarks. During learning stages, the team randomly sampled 100 problems from the DAPO-Math-17K dataset, running 3 epochs with a group size of 5 and a temperature of 0.7, then adjusted the temperature to 0.3 for online testing. The team also implemented infrastructure optimizations, including concurrency control and RESTful API wrapping, reducing iteration time by approximately 40% and enabling scaling to 128 GPUs without encountering timeout issues. Through modifications addressing “entropy explosion”, the actor’s KL divergence and gradient norm remained stable during reinforcement learning, while the critic score increased steadily, aligning with validation accuracy, demonstrating a robust execution framework and effective model training for continuously improving agents.

Youtu-Agent Achieves State-of-the-Art Agent Performance

Scientists achieved state-of-the-art performance on the WebWalkerQA benchmark, reaching an accuracy of 71.47%, and on the GAIA benchmark, attaining 72.8% using open-weight language models. These results demonstrate the capabilities of Youtu-Agent, a novel framework designed for automated generation and continuous evolution of LLM agents. The team measured tool synthesis success rates exceeding 81% with their automated generation pipeline, indicating a high degree of reliability in creating functional agent components. Experiments revealed that the Agent Practice module improved performance on the AIME 2024 benchmark by +2.7% and on the AIME 2025 benchmark by +5.4%, utilising in-context optimization without altering model parameters.

This improvement was achieved using the DeepSeek-V3.1-Terminus model, culminating in scores of 82.7% on AIME 2024 and 73.3% on AIME 2025, even when training without ground truth labels. The work highlights a cost-effective approach, requiring only 100 training examples and approximately $18 in learning costs, compared to alternative methods demanding over 10,000 samples and $10,000 for model fine-tuning. Tests prove that Agent RL.

👉 More information
🗞 Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization
🧠 ArXiv: https://arxiv.org/abs/2512.24615

Tags:

Agent Practice Agent RL Gaia in-context optimization Large Language Model agents reinforcement learning. tool synthesis WebWalkerQA Workflow mode Youtu-Agent

Youtu-agent Achieves Scalable LLM Agent Productivity with Automated Generation and Hybrid Optimisation

Youtu-Agent Achieves State-of-the-Art Agent Performance

Rohail T.

Latest Posts by Rohail T.:

Lasers Unlock New Tools for Molecular Sensing

Light’s Polarisation Fully Controlled on a Single Chip

New Quantum Algorithms Deliver Speed-Ups Without Sacrificing Predictability