Researchers are tackling the limitations of current computer-use agents (CUAs) which struggle with complex, long-term tasks due to reliance on static datasets. Taofeng Xue from Meituan, Chong Peng from Meituan, and Mianqiu Huang from Fudan University, alongside Linsen Guo, Tiancheng Han, and Haozhe Wang et al., present EvoCUA , a novel model that overcomes this bottleneck by evolving through self-generated experience. This work is significant because it introduces a self-sustaining cycle of data generation and policy optimisation, utilising a verifiable synthesis engine and scalable infrastructure to create and learn from tens of thousands of simulated computer interactions. Demonstrating a new state-of-the-art success rate of 56.7% on the OSWorld benchmark, and significantly outperforming models like OpenCUA-72B and UI-TARS-2, EvoCUA establishes a robust and scalable path towards truly capable native computer use agents.

EvoCUA learns via self-generating task evolution

Scientists have unveiled EvoCUA, a novel native computer use agent that represents a significant advancement in multimodal artificial intelligence . Unlike existing approaches limited by static datasets, EvoCUA integrates data generation and policy optimisation into a self-sustaining evolutionary cycle, overcoming the constraints of traditional imitation learning. The team achieved this breakthrough by addressing the critical bottleneck of scaling computer use agents, the inability of static data to capture the complex causal dynamics of long-horizon tasks. To mitigate data scarcity, researchers developed a verifiable synthesis engine capable of autonomously generating diverse tasks, each paired with an executable validator ensuring environmental grounding and precise supervision.

This innovative engine moves beyond simple text generation, instead analysing atomic capabilities to create self-contained task definitions, effectively eliminating ambiguity and providing deterministic feedback. To facilitate large-scale experience acquisition, the scientists designed a scalable infrastructure orchestrating tens of thousands of asynchronous sandbox rollouts, functioning as a dynamic gymnasium for continuous, on-policy optimisation. Building upon these massive trajectories, they propose an iterative evolving learning strategy that efficiently internalises experience by dynamically regulating policy updates, reinforcing successful routines and transforming failure trajectories into rich supervision through error analysis and self-correction. This mechanism allows the agent to focus intensely on boundary tasks, mirroring human learning dynamics and accelerating progress.

Empirical evaluations on the OSWorld benchmark demonstrate that EvoCUA achieves a success rate of 56.7%, establishing a new open-source state-of-the-art in native computer use. Notably, EvoCUA significantly outperforms the previous best open-source model, OpenCUA-72B, which achieved a 45.0% success rate, and surpasses leading closed-weights models such as UI-TARS-2 (53.1%). The study reveals that this evolving paradigm, driven by learning from experience, yields consistent performance gains across foundation models of varying scales, establishing a robust and scalable path for advancing native agent capabilities. Crucially, the research establishes a paradigm shift from data scaling via static traces to experience scaling via massive interactive rollouts, providing a richer supervisory signal than static text. This work opens new avenues for developing generalist agents capable of mastering Graphical User Interfaces and emulating human-computer interaction with increased reliability and efficiency, bringing artificial general intelligence closer to reality. The team’s contributions, verifiable synthesis, scalable infrastructure, and evolutionary optimisation, collectively address the core challenges of building truly intelligent computer use agents.

Scientists Method

Scientists developed EvoCUA, a novel native computer use model, to overcome limitations imposed by static data scaling in multimodal learning. Unlike traditional imitation learning approaches, this work integrates data generation and policy optimisation within a self-sustaining evolutionary cycle, enabling continuous improvement. To address data scarcity, the team engineered a verifiable synthesis engine capable of autonomously generating diverse computer tasks alongside executable validators, ensuring the creation of challenging and realistic scenarios. This innovative system designs a scalable infrastructure that orchestrates tens of thousands of asynchronous sandbox rollouts, facilitating large-scale experience acquisition.

Experiments employ these massive trajectories to implement an iterative evolving learning strategy, dynamically regulating policy updates by pinpointing capability boundaries. Successful routines are reinforced, while failure trajectories are transformed into rich supervision through detailed error analysis and self-correction, a process that actively refines the model’s understanding. The study pioneered a method for identifying and leveraging the strengths of varying foundation models, post-training on Qwen3-VL-Thinking (8B, 32B) and OpenCUA (7B, 32B, 72B) architectures. Evaluations on the OSWorld benchmark demonstrate that EvoCUA achieves a success rate of 56.7%, establishing a new state-of-the-art for open-source models.

Notably, EvoCUA significantly outperforms OpenCUA-72B (45.0%) by a margin of +11.7%, and surpasses the leading closed-weights model, UI-TARS-2 (53.1%), by +3.6%. The team constrained all models to a 50-step interaction budget, revealing EvoCUA’s superior execution precision compared to baselines typically requiring 100 steps. Furthermore, the research highlights the generalizability of this approach; the evolving paradigm, driven by learning from experience, consistently yields performance gains across foundation models of different scales. EvoCUA-8B achieved a 46.1% success rate, exceeding the performance of the 72B-parameter OpenCUA-72B, demonstrating the efficiency of the data synthesis and reinforcement learning strategies. This work confirms that the implemented methodology unlocks greater potential from foundational architectures, establishing a robust and scalable path for advancing native computer use capabilities.

EvoCUA learns via self-generated validated tasks

Scientists have developed EvoCUA, a novel native computer use agent, representing a significant advancement in multimodal artificial intelligence. This work addresses the limitations of current models reliant on static datasets by integrating data generation and policy optimisation into a self-sustaining evolutionary cycle. To overcome data scarcity, the team developed a verifiable synthesis engine capable of autonomously generating diverse tasks, each coupled with an executable validator, ensuring strict environmental grounding. This “Generation-as-Validation” approach delivers precise, deterministic supervision signals, eliminating ambiguity inherent in natural language rewards.

Experiments utilising a scalable infrastructure orchestrating tens of thousands of asynchronous sandbox rollouts enabled large-scale experience acquisition. The infrastructure functions as a dynamic gymnasium, providing real-time feedback and state transitions crucial for on-policy optimisation. Building on these massive trajectories, researchers proposed an iterative evolving learning strategy that efficiently internalises this experience, dynamically regulating policy updates by identifying capability boundaries. This mechanism reinforces successful routines while transforming failure trajectories into rich supervision through error analysis and self-correction, a process mirroring human learning dynamics.

Empirical evaluations on the OSWorld benchmark demonstrate that EvoCUA achieves a success rate of 56.7%, establishing a new open-source state-of-the-art. Notably, EvoCUA significantly outperforms the previous best open-source model, OpenCUA-72B, which achieved a 45.0% success rate, and surpasses leading closed-weights models such as UI-TARS-2, recording a 53.1% success rate. Measurements confirm that the evolving paradigm driven by learning from experience yields consistent performance gains across foundation models of varying scales, establishing a robust and scalable path for advancing native agent capabilities. Further data shows that performance steadily increased with model scale, moving from 26.6% with a Seed-1.8 model to 62.9% with EvoCUA-32B. The team recorded consistent gains across both open-weight and closed-weight models, highlighting the generalizability of the approach. This breakthrough delivers a substantial improvement in computer use agent performance, paving the way for more reliable and adaptable artificial intelligence systems.

EvoCUA surpasses OpenCUA on OSWorld benchmark significantly

Scientists have developed EvoCUA, a new native computer use model that overcomes limitations in existing systems reliant on static data scaling. This model integrates data generation and policy optimisation within a self-sustaining evolutionary cycle, addressing the challenges of capturing complex causal dynamics in long-horizon computer tasks. To overcome data scarcity, researchers created a verifiable synthesis engine to autonomously generate diverse tasks alongside executable validators, and a scalable infrastructure was designed to manage tens of thousands of asynchronous sandbox rollouts. Empirical evaluations on the OSWorld benchmark demonstrate EvoCUA achieves a 56.7% success rate, establishing a new state-of-the-art for open-source models.

Notably, EvoCUA outperforms OpenCUA-72B (45.0%) and surpasses leading closed-weights models like UI-TARS-2 (53.1%). Ablation studies confirm that each component of the evolutionary cycle, unified action space, cold start training, rejection fine-tuning, and iterative training, contributes to significant and consistent performance gains. The authors acknowledge a performance decline in variants based on the Qwen3-VL-Thinking foundation model, attributing this to differences in general data distribution, and plan to address this with an upgraded dataset. Future research will focus on incorporating this improved dataset to further enhance model generalisation and capabilities.

👉 More information
🗞 EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience
🧠 ArXiv: https://arxiv.org/abs/2601.15876

Tags:

asynchronous sandbox rollouts capability boundaries data generation error analysis! EvoCUA iterative evolving learning native computer use OSWorld benchmark policy optimization verifiable synthesis

Evocua Achieves 45.0% Performance Boost Via Evolving Synthetic Computer Use Agents

EvoCUA learns via self-generating task evolution

Scientists Method

EvoCUA learns via self-generated validated tasks

EvoCUA surpasses OpenCUA on OSWorld benchmark significantly

Rohail T.

Latest Posts by Rohail T.:

Critical Speed of Binary Superfluid of Light Achieves 2D Dissipation Limit

Sbo-QAOA Achieves Fair Sampling of Degenerate States with Four Variational Parameters

Cosmos Policy Achieves 98.5% Robot Control with Single-Stage Video Adaptation