Reinforcement Learning on Pre-Training Data Enables LLM Capability Beyond Supervised Scaling

The limitations of continually scaling large language models using only supervised learning are becoming increasingly apparent, as the availability of high-quality training data fails to keep pace with growing computational power. To overcome this challenge, Siheng Li, Kejiao Li, Zenan Xu, and colleagues introduce Reinforcement Learning on Pre-Training data (RLPT), a novel approach that allows language models to learn and improve directly from existing text. Unlike current methods that rely on human feedback to guide learning, RLPT cleverly extracts reward signals from the pre-training data itself, encouraging the model to predict subsequent text segments and refine its reasoning abilities. This technique not only enhances performance on a range of benchmarks, including improvements on MMLU, GPQA-Diamond and KOR-Bench, but also demonstrates promising scaling behaviour, suggesting that even greater gains are possible with increased computing resources and provides a strong foundation for further advancements in reinforcement learning for language models.

Large Language Models and Reinforcement Learning

Current research overwhelmingly focuses on large language models, exploring their training, scaling, evaluation, and improvement, with a significant portion investigating the use of reinforcement learning to enhance reasoning, safety, and instruction following. Researchers are actively exploring how to effectively apply reinforcement learning in this context, concentrating on improving the ability of large language models to perform complex reasoning tasks, including mathematical problems, logical inference, and general knowledge reasoning. Considerable effort is dedicated to creating and improving benchmarks for evaluating large language model performance, especially in areas like reasoning, mathematics, and general knowledge, with a strong push for more robust and challenging evaluations. Research also explores how large language model performance scales with model size, data size, and compute, and how to optimise training and inference for efficiency.

Several papers address aligning large language models with human values and ensuring their safety, while some researchers are exploring building large language models that can act as agents, planning and executing tasks in an environment. Key concepts and techniques referenced include Reinforcement Learning from Human Feedback, Proximal Policy Optimization, and Chain-of-Thought Prompting. Researchers are also investigating scaling laws, empirical relationships between model size, data size, and performance, and exploring new architectures like Mamba. Negative reinforcement, self-teaching, and data-constrained language models are also being explored. Specific benchmarks and datasets mentioned include MMLU, AMC/AIME, GPQA, Kimi K2, DeepSeekMath, Qwen3, and Hunyuan-Turbos. Current trends and concerns include a push for more robust evaluation, challenges in aligning large language models with human values, and questions about the limits of scaling, with growing interest in developing methods that can achieve good performance with limited data.

Reasoning Improvement via Reinforcement Learning on Data

Scientists have developed a new training paradigm, Reinforcement Learning on Pre-Training data (RLPT), to optimise large language models and overcome limitations in scaling computational resources and data. This work moves beyond traditional supervised learning approaches by enabling large language models to autonomously explore reasoning trajectories and learn from existing pre-training data through reinforcement learning, eliminating the need for human annotation, a significant constraint in existing reinforcement learning frameworks. The core of RLPT involves a next-segment reasoning objective, where the model predicts subsequent text segments based on preceding context, and rewards are assigned based on the semantic consistency between predicted and actual text. Researchers implemented two tasks, Autoregressive Segment Reasoning and Middle Segment Reasoning, to simultaneously optimise both text generation and in-context understanding abilities.

Extensive experiments across general-domain and mathematical reasoning benchmarks demonstrate substantial improvements in large language model performance, with Qwen3-4B-Base yielding absolute gains of 3. 0, 5. 1, 8. 1, 6. 0, 6.

6, and 5. 3 on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. Further analysis reveals favorable scaling behavior, with downstream performance following a predictable scaling law with increased training compute, suggesting potential for continued gains. The team also demonstrated that RLPT strengthens the reasoning capabilities of large language models, providing a solid foundation for other reinforcement learning techniques like RLVR, resulting in additional performance improvements on AIME24 and AIME25. These results confirm that RLPT not only enhances large language model performance but also extends their reasoning boundaries and improves their overall capabilities.

Reinforcement Learning Scales with Pre-Training Data

This work introduces Reinforcement Learning on Pre-Training data, or RLPT, a novel training approach that enhances the capabilities of large language models. Unlike conventional methods that rely heavily on supervised learning or human-provided feedback, RLPT leverages a self-supervised objective, predicting subsequent text segments to generate reward signals directly from pre-training data, eliminating the need for human annotation and allowing reinforcement learning to be effectively scaled using vast quantities of unlabeled text. Extensive experimentation demonstrates that RLPT significantly improves performance on a range of benchmarks, including general knowledge and mathematical reasoning tasks, with models achieving substantial gains across multiple evaluations, indicating a robust and generalizable improvement in reasoning skills. Furthermore, the results suggest that performance continues to improve with increased computational resources, highlighting the potential for even greater advancements. The authors acknowledge that while RLPT offers a strong foundation, further research is needed to fully explore its capabilities and optimise its performance, and note its potential to enhance existing reinforcement learning strategies, such as those utilizing verifiable rewards.

👉 More information
🗞 Reinforcement Learning on Pre-Training Data
🧠 ArXiv: https://arxiv.org/abs/2509.19249

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

AI Swiftly Answers Questions by Focusing on Key Areas

AI Swiftly Answers Questions by Focusing on Key Areas

February 27, 2026
Machine Learning Sorts Quantum States with High Accuracy

Machine Learning Sorts Quantum States with High Accuracy

February 27, 2026
Framework Improves Code Testing with Scenario Planning

Framework Improves Code Testing with Scenario Planning

February 27, 2026