The question of whether continually increasing the size of large language models (LLMs) delivers diminishing returns remains central to artificial intelligence research, and a team led by Akshit Sinha from the University of Cambridge, Arvindh Arun from the Institute for AI at the University of Stuttgart, and Shashwat Goel from the Max Planck Institute for Intelligent Systems now presents compelling evidence that this is not necessarily the case. The researchers demonstrate that even small gains in single-step accuracy can compound into exponential improvements in the length of tasks an LLM successfully completes, suggesting that scaling size continues to offer substantial benefits for complex, multi-step problems. Their work reveals that failures in longer tasks stem not from a lack of reasoning ability, but from errors in execution, and they identify a curious ‘self-conditioning’ effect where LLMs become more prone to mistakes as the context includes their previous errors. By isolating and measuring execution capability, and contrasting it with more recent models that avoid self-conditioning, the team highlights the potential of scaling both model size and sequential processing power to overcome limitations in long-horizon tasks.

Horizon Length Limits Arithmetic Reliability in LLMs

This study comprehensively investigates how reliably large language models (LLMs) perform arithmetic tasks as complexity increases, focusing on accuracy decline with longer tasks, the impact of model size, and the benefits of chain-of-thought reasoning. The authors propose a theoretical framework linking task length to per-step accuracy, supported by experimental evidence, and carefully analyze potential error sources, including formatting failures and error accumulation. The core research question centers on how well LLMs execute multi-step arithmetic tasks as task length increases, and what factors influence this reliability. Models were tasked with additions, and performance was measured by task accuracy, horizon length, effective task length, and format following failures, comparing several models from the Gemma and Qwen3 families with and without chain-of-thought reasoning.

The authors established a theoretical relationship between task length and per-step accuracy, demonstrating that even small improvements in accuracy can significantly extend reliable task length. Larger models consistently exhibit higher per-step accuracy and longer reliable task lengths, while chain-of-thought reasoning improves performance but is susceptible to error propagation. The research highlights that improvements in accuracy become increasingly impactful as models approach near-perfect performance. The paper presents extensive experimental results demonstrating the relationship between model size, reasoning approach, and task accuracy, meticulously analyzing error types, including format failures, arithmetic mistakes, and reasoning errors. Sensitivity analysis explores how changes in per-step accuracy affect task length, demonstrating the importance of even minor improvements in model performance. This research provides a rigorous analysis of the factors influencing the reliability of LLMs in multi-step reasoning tasks, highlighting the importance of model scale, per-step accuracy, and error propagation effects.

Long Tasks Reveal Scaling Benefits in LLMs

This study investigated how scaling large language models (LLMs) impacts their ability to complete increasingly complex tasks, focusing on the length of tasks successfully completed rather than single-step accuracy. Researchers observed that small gains in single-step accuracy can exponentially improve the length of a task an LLM can successfully navigate, isolating the execution capability of the models by providing both necessary knowledge and a pre-defined plan. Experiments revealed that larger models could correctly execute significantly more steps, even when smaller models achieved 100% accuracy on individual steps. A key methodological innovation involved systematically introducing errors into the models’ execution histories to analyze a phenomenon termed “self-conditioning.

” The team discovered that per-step accuracy degrades as the number of steps increases, not solely due to context length limitations, but because models become more prone to errors when exposed to their own prior mistakes. Researchers carefully constructed task sequences, monitoring performance as models processed increasingly long histories of their own outputs. To quantify the relationship between step accuracy and task length, the team developed a mathematical proposition demonstrating that, assuming constant per-step accuracy, the achievable success rate is directly related to the task horizon length. They derived an equation predicting the horizon length at which a model will achieve a desired success rate, given a constant per-step accuracy, revealing that even minor gains in accuracy can lead to substantial increases in the length of tasks successfully completed. The team further defined Effective Task Length as the number of turns at which Task Accuracy drops to 0. 5, providing a metric for quantifying the practical limits of model performance.

Larger Models Excel at Extended Reasoning

This work demonstrates that scaling the size of large language models (LLMs) delivers exponential gains in their ability to execute long tasks, even when single-step accuracy improvements diminish. Researchers isolated the execution capability of LLMs by providing both necessary knowledge and a pre-defined plan, revealing that larger models can successfully execute significantly more turns. The study identified that the per-step accuracy of LLMs actually degrades as the task progresses, unlike human performance, and is not simply due to compounding errors. This degradation is a self-conditioning effect, where models become more prone to mistakes when conditioned on their own previous errors, confirmed by tests showing that increasing the error rate in the model’s history sharply reduces subsequent step accuracy.

Recent “thinking” models circumvent this self-conditioning, demonstrating a substantial improvement in long-task execution. Sequential test-time compute dramatically increases the length of tasks a model can complete in a single turn, for example, the DeepSeek V3 model, when augmented with thinking capabilities, can execute 200 steps compared to failing after only two steps without this augmentation. Benchmarking frontier thinking models, the GPT-5 “Horizon” model achieved over 1000 steps, significantly surpassing Claude-4-Sonnet at 432 steps, suggesting that continued investment in scaling compute and developing thinking models remains worthwhile.

Long Tasks Reveal Self-Conditioning in LLMs

This research demonstrates that improvements in the size of large language models (LLMs) yield substantial gains in their ability to successfully complete lengthy tasks, even when single-step accuracy remains constant. The team found that while LLMs can achieve high accuracy on individual steps, performance degrades as the number of steps increases.

👉 More information
🗞 The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
🧠 ArXiv: https://arxiv.org/abs/2509.09677

Tags:

execution capability Large Language Models long-horizon tasks scaling self-conditioning sequential test-time compute single-turn accuracy

Quantum News

Long Horizon Execution in LLMs Demonstrates Exponential Gains Despite 100% Single-Turn Accuracy

Horizon Length Limits Arithmetic Reliability in LLMs

Long Tasks Reveal Scaling Benefits in LLMs

Larger Models Excel at Extended Reasoning

Long Tasks Reveal Self-Conditioning in LLMs

Latest Posts by Quantum News:

Notre Dame Researchers Link Chronic Compression to Neuron Death, Published in PNAS

AI Predicts Quantum System Behaviour for Faster, More Reliable Control

Quantum Teleportation Between Cities Moves Closer with New Hardware Blueprint