Os-Marathon Achieves Robust Agent Benchmarking across 242 Long-Horizon Repetitive Tasks

Scientists are tackling the challenge of automating long, repetitive digital workflows , common in tasks like expense report processing and data entry. Jing Wu, Daphne Barretto (Microsoft), and Yiye Chen (Georgia Institute of Technology), alongside colleagues including Nicholas Gydé, Yanan Jian and Yuhang He, recognised a critical lack of standardised testing for Computer-Use Agents (CUAs) designed for these scenarios. To address this, they’ve created OS-Marathon, a benchmark comprising 242 tasks across two domains, allowing rigorous evaluation of current state-of-the-art agents. Significantly, the team also developed a remarkably efficient teaching method , using just a few examples , to enable agents to learn and then effectively handle much larger, previously unseen datasets, demonstrating a pathway towards truly scalable automation.

Significantly, the team also developed a remarkably efficient teaching method, using just a few examples, to enable agents to learn and then effectively handle much larger, previously unseen datasets, demonstrating a pathway towards truly scalable automation.

OS-Marathon benchmark for long-horizon agent evaluation is challenging

These tasks, mirroring common professional workflows like expense report processing and student grade entry, present a significant challenge due to their extended duration and structured, recurring sub-workflows. Consequently, they designed OS-Marathon to specifically address this limitation, offering a standardised platform for assessing long-horizon performance. This innovative approach circumvents the limitations of directly integrating lengthy workflows into current CUAs due to context length restrictions. Experiments reveal inherent difficulties for existing agents, including logical incoherence in task ordering, action planning hallucinations, and a struggle to maintain consistency across repetitive sub-workflows.
The team discovered agents frequently execute tasks illogically or attempt actions without grounding them in the current workflow state, leading to failures. This strategy provides dual-level instruction, guiding agents in both global planning, orchestrating the repetitive loop, and sub-workflow execution, mastering the fundamental logic of each step. By abstracting workflows into key steps, the method enables state-of-the-art agents to adapt efficiently to larger, unseen data collections. The work establishes a formal definition for long-horizon, repetitive CUA tasks and introduces a benchmark spanning expense reporting and transcript processing domains, utilising seven distinct execution environments.

Furthermore, the researchers evaluated leading CUAs on OS-Marathon, revealing three primary failure modes: logical incoherence, hallucination, and inconsistency. Agents often struggled with complex workflow structures and maintaining consistency across repetitive sub-workflows, highlighting the need for improved long-horizon reasoning capabilities. The project website, accessible at https://os-marathon. github. io/, provides further details and resources for the research community.

OS-Marathon Benchmark for Long-Horizon Task Evaluation

The research team specifically targeted a gap in existing benchmarks which largely focused on short-horizon tasks, neglecting the challenges presented by extended, iterative workflows common in professional settings. Experiments employed two primary domains: an expense report system and a GPA calculator, each representing a realistic, data-intensive workflow requiring repetitive sub-processes. These domains were chosen to reflect tasks tedious for humans but ideally suited for automation via CUAs due to their structured and recurring nature. This approach enables agents to generalise and execute similar workflows on larger, previously unseen data collections, addressing a key limitation of traditional training methods.

Researchers meticulously designed tasks within each domain, varying horizon length and document complexity to facilitate fine-grained evaluation of agent performance across multiple difficulty levels. Scientists harnessed fully functional web-based systems and local spreadsheet applications as execution environments, creating a diverse and realistic testing ground for the agents. The team observed three primary failure modes in leading CUAs when confronted with the OS-Marathon tasks: logical incoherence in task ordering, hallucination during action planning, and failures to ground actions on the current sub-workflow state. For instance, agents frequently attempted to populate system fields without first extracting relevant data from source documents, leading to errors. This work introduces a standardised benchmark, OS-Marathon, specifically tailored to evaluate CUA performance in long-horizon, repetitive execution scenarios, comprising 242 tasks across 2 domains and 7 distinct execution environments. The project website, accessible at https://os-marathon. github. io/, provides further details and resources for the research community.

OS-Marathon benchmark stresses long-run agent performance with challenging

Experiments revealed that challenges predominantly arise from the volume of data instances and the complexity of processing each individual instance, particularly when dealing with multi-page PDFs and dense document layouts. Levels 1 and 2 concentrate on fundamental capabilities, while Levels 3 and 4 simulate realistic scenarios with increased receipt volumes, challenging agents to maintain context over longer execution horizons. Similarly, the Transcript domain features three levels determined by course number and layout complexity, progressing from single-page, single-column PDFs to multi-page documents with variable layouts. Data shows that the workload scales with difficulty, increasing from tens of courses at lower levels to hundreds in the most advanced tiers.

Synthetic receipts were generated via Large Language Models (LLMs) and rendered with templates to create coherent timelines. The Transcript domain comprises 52 real tasks and 30 synthetic transcript tasks, leveraging pre-built templates and synthesised student profiles. Results demonstrate the effectiveness of this task construction strategy in creating a diverse and challenging benchmark for CUA evaluation. To move beyond simple binary success rates, researchers introduced Sub-Workflow Accuracy (SWA), a novel metric quantifying agent performance over extended action sequences. SWA is calculated as the number of correctly executed sub-workflows divided by the total number of sub-workflows (n/N), providing a fine-grained measurement of an agent’s reliability in long-horizon tasks. The breakthrough delivers a method to construct a condensed demonstration using only a few examples, enabling agents to effectively execute similar workflows on larger, unseen data collections.

OS-Marathon benchmark tests agent long-term repetition avoidance

Extensive experimentation revealed the inherent difficulties these long-horizon tasks present for current state-of-the-art agents, with many failing to achieve success even on simpler levels. However, the application of the proposed demonstration method, particularly when combined with the AgentS2.5 framework and GPT-5, significantly improved performance, demonstrating its effectiveness in facilitating agent learning. The authors acknowledge limitations in the scope of the benchmark, currently focusing on Levels 1 and 2 tasks, and the computational expense of full-scale evaluation. Future research will focus on extending OS-Marathon to include Levels 3 and 4, representing more complex challenges for CUAs.

The team also intends to explore methods for further reducing the cost of demonstration creation and improving the generalizability of agents across diverse workflows. These findings highlight the importance of dedicated benchmarks for evaluating long-horizon agent capabilities and suggest that focused demonstration techniques can substantially enhance performance in repetitive, structured tasks. This work contributes to the advancement of practical, reliable CUAs for automating tedious workflows in professional settings.

👉 More information
🗞 OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks
🧠 ArXiv: https://arxiv.org/abs/2601.20650

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Ferromagnetism Achieved in -Orbital Hexagonal Lattice Fermions Via Double-Exchange at Half-Filling

Ferromagnetism Achieved in -Orbital Hexagonal Lattice Fermions Via Double-Exchange at Half-Filling

January 30, 2026
Mixed Precision Advances Variational Monte Carlo with 64-Bit Error Bounds

Mixed Precision Advances Variational Monte Carlo with 64-Bit Error Bounds

January 30, 2026
Secondary Autler-Townes Splitting Achieved Via Four-Level Quantum Frequency Mixing

Secondary Autler-Townes Splitting Achieved Via Four-Level Quantum Frequency Mixing

January 30, 2026