Researchers have long assumed that increasing the diversity of training data is key to improving the performance of large language models. Dawid J. Kopiczko from University of Technology Nuremberg, Sagar Vaze from Mistral AI, and Tijmen Blankevoort from NVIDIA, working with colleagues at University of Technology Nuremberg, challenge this notion in new research demonstrating the surprising benefits of data repetition during supervised fine-tuning. Their findings reveal that training on a smaller, repeatedly presented dataset can significantly outperform training on a much larger, unique dataset under fixed computational constraints. Specifically, utilising the Olmo3-7B model, the team achieved improvements of 12-26 percentage points on reasoning benchmarks AIME’24/25 and GPQA by training for 128 epochs on just 400 samples, compared to a single epoch on 51,200 samples. This work not only offers a more efficient and cost-effective approach to fine-tuning for reasoning tasks, but also introduces a novel observation, that full memorisation of a dataset can coincide with improved generalisation, posing a new question for the machine learning community to explore.
New research demonstrates that for fine-tuning language models, a surprising strategy yields significant gains, training on a comparatively small dataset for a substantially larger number of epochs outperforms training on a much larger dataset for a single epoch, provided the total computational effort remains constant.
This counterintuitive finding challenges conventional wisdom and suggests a more efficient pathway for refining these powerful artificial intelligence systems. The work centres on supervised fine-tuning (SFT), an essential post-training step for enhancing the reasoning capabilities of language models.
Researchers discovered that repeatedly exposing a language model to a curated dataset, even a limited one, dramatically improves its performance. Specifically, when training the Olmo3-7B model, 128 epochs on just 400 samples resulted in a performance increase of 12-26 percentage points compared to training on 51,200 samples for only one epoch.
This improvement was observed across benchmarks designed to test complex reasoning skills, such as AIME’24/25 and GPQA. The study reveals that training token accuracy, how reliably the model predicts the next word in a sequence, serves as a crucial indicator of when this repetition-based approach reaches its peak effectiveness.
The team found that performance gains plateau when the model achieves near-perfect memorization of the training data. This saturation point, consistently observed across various settings, suggests that full memorization coincides with improved generalisation. Importantly, this intensive training regime did not lead to catastrophic forgetting, a phenomenon where the model loses previously learned knowledge.
These findings offer a practical approach to reasoning SFT, where scaling epochs with token accuracy can replace expensive and undirected data scaling. The research poses a new question for the machine learning community: understanding why repeated exposure to a limited dataset can enhance a language model’s ability to generalise and reason effectively.
Repeated exposure to limited data enhances language model accuracy beyond large single-epoch datasets
Olmo3-7B trained for 128 epochs on just 400 samples achieved a performance increase of 12-26 percentage points compared to training on 51,200 samples for only one epoch, maintaining the same total computational effort. This result demonstrates a counterintuitive finding: repetition in supervised fine-tuning benefits language model performance more than simply increasing the volume of training data.
The research establishes that, under a fixed update budget, prioritising more epochs with a smaller dataset consistently outperforms a single epoch with a much larger dataset. Specifically, with an update budget of 51,200, Olmo3-7B trained for 32 epochs on 1,600 samples reached an average accuracy of 39% across benchmarks, a substantial improvement over the 17% accuracy achieved by training on 51,200 samples for a single epoch.
This pattern of improved performance with repeated exposure to a curated dataset was observed consistently across different benchmarks and models, with top performances clustered towards the higher end of an epochs-versus-samples pyramid. Gains from additional epochs typically plateaued around 32, 64 epochs, indicating a point of saturation beyond which further repetition yielded limited benefit.
Further investigation using math-focused datasets distilled from both Qwen3-0.6B and Qwen3-8B teacher models confirmed the persistence of this repetition advantage. Even when using a weaker teacher model, increasing the update budget from 6,400 to 25,600 did not negate the benefits of epoch scaling, though performance did degrade slightly. Notably, training on incorrect reasoning traces did not harm performance, and in some cases, even matched or exceeded results from training on correct examples, with peak performance on GPQA and AIME’24 achieved with negative trajectories.
Computational Framework and Experimental Setup for Supervised Fine-tuning
A bfloat16 floating-point format and Unsloth optimised kernels underpinned the computational infrastructure for this work, enabling efficient training of large language models. Models, specifically Qwen3-4B, Qwen3-8B, and Olmo3-7B, were loaded into this format, representing pretrained checkpoints devoid of prior instruction tuning to establish a consistent baseline for supervised fine-tuning (SFT) dynamics.
An 8-bit Adam optimizer, coupled with a cosine learning rate schedule, was implemented to refine model parameters during training. To isolate the impact of data repetition, a carefully constructed experimental grid was employed, varying both dataset size and the number of training epochs while maintaining a fixed update budget.
Nested training splits were created, ranging from 200 to 51,200 samples, ensuring each smaller split was a subset of the next larger one, and a validation set of 1000 samples was held out for analysis. This approach allowed for a direct comparison of configurations with equivalent total optimisation steps, effectively decoupling the effects of data scale from the benefits of repeated exposure.
Training proceeded with a batch size of one, a configuration supported by recent findings indicating comparable or superior per-token performance with smaller batch sizes. Input prompts were masked, and cross-entropy loss was calculated solely on response tokens, focusing the optimisation process on generating accurate and coherent continuations.
A learning rate sweep was conducted using a 1-epoch, 51,200-sample configuration to identify the optimal learning rate for each model, which was then consistently applied across all subsequent training runs. Each configuration was executed on a single H100 94GB GPU for a maximum duration of 24 hours.
The Bigger Picture
Scientists have long assumed that in machine learning, more data invariably equates to better performance. Recent work challenges this deeply held belief, demonstrating that for fine-tuning large language models, a surprising principle applies: smaller, carefully curated datasets, repeatedly processed, can significantly outperform massive datasets used only once.
This isn’t merely a marginal gain; the research reveals performance increases of between twelve and twenty-six percentage points when training was focused on repetition rather than scale. The difficulty lies in efficiently harnessing the power of these models. Simply throwing more data at the problem has become the standard approach, but this is computationally expensive and doesn’t necessarily yield proportional improvements.
This new finding suggests a shift in focus, towards optimising how data is used, not just how much. The implications are considerable, potentially lowering the barrier to entry for researchers and developers lacking access to vast computational resources. However, the study hinges on the concept of ‘memorisation’ and identifying the point at which repetition yields diminishing returns.
Accurately gauging this saturation point is crucial, and further research is needed to refine the metrics used to determine it. Moreover, the work focuses on reasoning tasks; whether this repetition advantage extends to other areas of language model application remains an open question. Looking ahead, we might see a move away from indiscriminate data collection towards more targeted dataset curation, coupled with adaptive training regimes that prioritise repeated exposure and intelligent stopping criteria. This could herald a new era of efficient and accessible machine learning.
👉 More information
🗞 Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning
🧠 ArXiv: https://arxiv.org/abs/2602.11149
