Long Horizon Execution in LLMs Demonstrates Exponential Gains Despite 100% Single-Turn Accuracy

The question of whether continually increasing the size of large language models (LLMs) delivers diminishing returns remains central to artificial intelligence research, and a team led by Akshit Sinha from the University of Cambridge, Arvindh Arun from the Institute for AI at the University of Stuttgart, and Shashwat Goel from the Max Planck Institute for Intelligent Systems now presents compelling evidence that this is not necessarily the case. The researchers demonstrate that even small gains in single-step accuracy can compound into exponential improvements in the length of tasks an LLM successfully completes, suggesting that scaling size continues to offer substantial benefits for complex, multi-step problems. Their work reveals that failures in longer tasks stem not from a lack of reasoning ability, but from errors in execution, and they identify a curious ‘self-conditioning’ effect where LLMs become more prone to mistakes as the context includes their previous errors. By isolating and measuring execution capability, and contrasting it with more recent models that avoid self-conditioning, the team highlights the potential of scaling both model size and sequential processing power to overcome limitations in long-horizon tasks.

Horizon Length Limits Arithmetic Reliability in LLMs

This study comprehensively investigates how reliably large language models (LLMs) perform arithmetic tasks as complexity increases, focusing on accuracy decline with longer tasks, the impact of model size, and the benefits of chain-of-thought reasoning. The authors propose a theoretical framework linking task length to per-step accuracy, supported by experimental evidence, and carefully analyze potential error sources, including formatting failures and error accumulation. The core research question centers on how well LLMs execute multi-step arithmetic tasks as task length increases, and what factors influence this reliability. Models were tasked with additions, and performance was measured by task accuracy, horizon length, effective task length, and format following failures, comparing several models from the Gemma and Qwen3 families with and without chain-of-thought reasoning.

The authors established a theoretical relationship between task length and per-step accuracy, demonstrating that even small improvements in accuracy can significantly extend reliable task length. Larger models consistently exhibit higher per-step accuracy and longer reliable task lengths, while chain-of-thought reasoning improves performance but is susceptible to error propagation. The research highlights that improvements in accuracy become increasingly impactful as models approach near-perfect performance. The paper presents extensive experimental results demonstrating the relationship between model size, reasoning approach, and task accuracy, meticulously analyzing error types, including format failures, arithmetic mistakes, and reasoning errors. Sensitivity analysis explores how changes in per-step accuracy affect task length, demonstrating the importance of even minor improvements in model performance. This research provides a rigorous analysis of the factors influencing the reliability of LLMs in multi-step reasoning tasks, highlighting the importance of model scale, per-step accuracy, and error propagation effects.

Long Tasks Reveal Scaling Benefits in LLMs

This study investigated how scaling large language models (LLMs) impacts their ability to complete increasingly complex tasks, focusing on the length of tasks successfully completed rather than single-step accuracy. Researchers observed that small gains in single-step accuracy can exponentially improve the length of a task an LLM can successfully navigate, isolating the execution capability of the models by providing both necessary knowledge and a pre-defined plan. Experiments revealed that larger models could correctly execute significantly more steps, even when smaller models achieved 100% accuracy on individual steps. A key methodological innovation involved systematically introducing errors into the models’ execution histories to analyze a phenomenon termed “self-conditioning.

” The team discovered that per-step accuracy degrades as the number of steps increases, not solely due to context length limitations, but because models become more prone to errors when exposed to their own prior mistakes. Researchers carefully constructed task sequences, monitoring performance as models processed increasingly long histories of their own outputs. To quantify the relationship between step accuracy and task length, the team developed a mathematical proposition demonstrating that, assuming constant per-step accuracy, the achievable success rate is directly related to the task horizon length. They derived an equation predicting the horizon length at which a model will achieve a desired success rate, given a constant per-step accuracy, revealing that even minor gains in accuracy can lead to substantial increases in the length of tasks successfully completed. The team further defined Effective Task Length as the number of turns at which Task Accuracy drops to 0. 5, providing a metric for quantifying the practical limits of model performance.

Larger Models Excel at Extended Reasoning

This work demonstrates that scaling the size of large language models (LLMs) delivers exponential gains in their ability to execute long tasks, even when single-step accuracy improvements diminish. Researchers isolated the execution capability of LLMs by providing both necessary knowledge and a pre-defined plan, revealing that larger models can successfully execute significantly more turns. The study identified that the per-step accuracy of LLMs actually degrades as the task progresses, unlike human performance, and is not simply due to compounding errors. This degradation is a self-conditioning effect, where models become more prone to mistakes when conditioned on their own previous errors, confirmed by tests showing that increasing the error rate in the model’s history sharply reduces subsequent step accuracy.

Recent “thinking” models circumvent this self-conditioning, demonstrating a substantial improvement in long-task execution. Sequential test-time compute dramatically increases the length of tasks a model can complete in a single turn, for example, the DeepSeek V3 model, when augmented with thinking capabilities, can execute 200 steps compared to failing after only two steps without this augmentation. Benchmarking frontier thinking models, the GPT-5 “Horizon” model achieved over 1000 steps, significantly surpassing Claude-4-Sonnet at 432 steps, suggesting that continued investment in scaling compute and developing thinking models remains worthwhile.

Long Tasks Reveal Self-Conditioning in LLMs

This research demonstrates that improvements in the size of large language models (LLMs) yield substantial gains in their ability to successfully complete lengthy tasks, even when single-step accuracy remains constant. The team found that while LLMs can achieve high accuracy on individual steps, performance degrades as the number of steps increases.

👉 More information
🗞 The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
🧠 ArXiv: https://arxiv.org/abs/2509.09677

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

Scientists Guide Zapata's Path to Fault-Tolerant Quantum Systems

Scientists Guide Zapata’s Path to Fault-Tolerant Quantum Systems

December 22, 2025
NVIDIA’s ALCHEMI Toolkit Links with MatGL for Graph-Based MLIPs

NVIDIA’s ALCHEMI Toolkit Links with MatGL for Graph-Based MLIPs

December 22, 2025
New Consultancy Helps Firms Meet EU DORA Crypto Agility Rules

New Consultancy Helps Firms Meet EU DORA Crypto Agility Rules

December 22, 2025