The quest to create truly intelligent operating systems takes a significant step forward with ColorAgent, a new system developed by Ning Li, Qiqiang Lin, and Zheng Wu, alongside colleagues at their institutions. This research introduces an operating system agent capable of sustained, reliable interaction with a device, moving beyond simple automation towards a genuinely collaborative user experience. ColorAgent achieves this through a combination of advanced reinforcement learning and a novel multi-agent framework, ensuring both consistency and adaptability in complex environments. The team demonstrates the system’s effectiveness on standard benchmarks, achieving state-of-the-art success rates and paving the way for operating systems that proactively understand and respond to individual user needs.
LLMs, Reinforcement Learning and GUI Automation
This document presents a comprehensive overview of recent research concerning GUI automation and mobile agents, categorizing key findings and outlining prominent trends in the field. The research focuses on leveraging Large Language Models (LLMs) to understand user intent and control mobile devices, often integrating these models with reinforcement learning (RL) techniques to train agents for effective interaction. A dominant approach involves using RL, particularly Proximal Policy Optimization (PPO), to train agents to interact with graphical user interfaces. Researchers are actively investigating methods for evaluating and improving the quality of agent-generated action sequences, employing discriminators to assess task completion, action validity, and redundancy.
Several studies explore the use of human demonstrations to bootstrap learning or provide feedback to agents, while others utilize graph-structured frameworks to represent GUI state and task structure. The development of foundation models capable of handling a wide range of GUI tasks and agents capable of self-reflection and error diagnosis are also prominent areas of investigation. Recent work has produced several foundation models and generalist agents, including OS-ATLAS, Opencua, and Androidlab, designed for broad applicability in GUI automation. Many studies implicitly integrate LLMs to enhance agent capabilities, while others, like Verios and Gem, explicitly focus on LLM-based interaction and proactive engagement.
Further research concentrates on refining RL techniques, such as multi-turn learning and multi-agent collaboration, to improve performance and robustness. Key actions within the GUI action space include clicking, long-pressing, swiping, typing, and system button presses. The field is rapidly evolving, with a surge of recent work building upon established RL and human-computer interaction research. Scaling agents to handle complex, long-horizon tasks remains a significant challenge, requiring more sophisticated planning and reasoning capabilities. Robustness and generalization are also critical, as agents must be able to handle variations in GUI layouts and user behavior. Explainability and trust are paramount, as users need to understand the reasoning behind agent actions. The combination of LLMs and RL offers a promising path forward, leveraging the strengths of both approaches to create intelligent and adaptable operating system agents.
Self-Evolving Training for Mobile Interaction Agents
Researchers engineered ColorAgent, an operating system agent designed for robust and long-lasting interaction with mobile devices, employing a novel self-evolving training cycle to overcome limitations of static datasets. This cycle systematically generates, evaluates, and refines interaction data, continuously enhancing the model’s performance and enabling it to move beyond the constraints of pre-defined training examples. The process begins with domain experts crafting a foundational set of high-quality queries, ensuring practical relevance, which are then expanded upon using a large language model to create a diverse range of interactions. The core of the training methodology involves an iterative three-stage process: trajectory rollout, trajectory filtering, and fine-tuning.
During trajectory rollout, the model interacts with both virtual and physical devices, utilizing the generated queries to produce step-by-step interaction trajectories, repeating each query multiple times with varied conditions to reflect real-world user behavior. A stringent filtering module then evaluates trajectory quality, employing specialized discriminators to assess task completion, action validity, and reasoning coherence, retaining only high-quality data for further refinement. Incorrect trajectories identified by these discriminators undergo manual evaluation and correction, with these corrected examples integrated into the fine-tuning dataset to provide targeted training signals. Analysis revealed that limitations in generalization, reflection, and consistency hindered robust deployment in real-world mobile environments. This insight motivated the development of a multi-agent framework, designed to address these shortcomings by enabling dynamic adaptation to minor UI variations and facilitating the incorporation of external knowledge, allowing the agent to learn from past experiences and evolve its strategy over time. The team achieved state-of-the-art success rates on the AndroidWorld and AndroidLab benchmarks, demonstrating the effectiveness of this innovative methodology.
Android Agent Learns Complex Long-Horizon Tasks
The development of ColorAgent represents a significant breakthrough in operating system interaction, establishing a new standard for intelligent OS agents capable of robust, long-horizon interactions and personalized user engagement. Researchers achieved state-of-the-art performance on dynamic Android benchmarks, attaining a success rate of 77. 2% on AndroidWorld and 50. 7% on AndroidLab, demonstrating a substantial advancement in autonomous task execution. This performance is built upon a two-stage training paradigm, beginning with step-wise reinforcement learning and progressing to self-evolving training, which addresses key challenges like perception ambiguity and action grounding.
The team meticulously constructed a training dataset by leveraging seven public GUI interaction datasets, ensuring data quality through splitting, augmentation, filtering, and enhancement. This approach allows the model to learn generalizable GUI interaction patterns and adapt to complex environments. Experiments on MobileIAR and VeriOS-Bench further demonstrate ColorAgent’s capabilities, achieving performance scores of 58. 66% and 68. 98% respectively, outperforming all baseline models in human-agent interaction scenarios.
Central to ColorAgent’s success is its ability to learn when to trust its environment and when to actively seek clarification from the user, fostering a collaborative partnership rather than a simple automation tool. The system utilizes a large language model to identify steps with multiple valid actions, expanding annotations and enabling the model to navigate interface redundancy and diverse user habits. This innovative approach moves beyond rigid annotation paradigms, aligning the model with the complexities of real-world GUI interactions and paving the way for more intuitive and effective operating system agents.
Proactive Agent Learns and Collaborates with Users
ColorAgent represents a significant advance in the development of operating system agents, demonstrating both robust interaction with dynamic environments and a capacity for personalized, proactive engagement with users. By combining step-wise reinforcement learning with a carefully designed multi-agent framework, researchers have achieved a system capable of sustaining long-horizon interactions, successfully completing tasks within the operating system while adapting to changing conditions. Furthermore, the agent’s ability to recognize user intent moves it beyond simple automation, positioning it as a collaborative partner.
👉 More information
🗞 ColorAgent: Building A Robust, Personalized, and Interactive OS Agent
🧠 ArXiv: https://arxiv.org/abs/2510.19386
