Reinforcement Learning Improves Reasoning in Large Language Models via Guidance.

Reinforcement learning via verifiable rewards enables large language models to solve novel reasoning problems, primarily through self-distillation and capability gain across varying model scales. A new training algorithm, Guide, incorporating adaptive hints improves generalisation, achieving up to a 4% performance increase on mathematical benchmarks.

The capacity of artificial intelligence systems to exhibit reasoning abilities remains a central challenge in the field, with current models often struggling to generalise beyond their training data. Recent research focuses on reinforcement learning with verifiable rewards (RLVR) as a method to enhance these capabilities, observing that performance improvements stem from both compressing existing knowledge and acquiring entirely new skills. A team from Scale AI, comprising Vaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, and Sean Hendryx, investigates this phenomenon in detail, demonstrating that self-distillation plays a key role in enabling models to solve previously intractable problems. Their work, titled “Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models”, introduces a novel online training algorithm, termed ‘adapt’, which strategically incorporates contextual hints during training, subsequently reducing reliance on these hints to improve generalisation across mathematical, scientific, and coding tasks. The researchers demonstrate improvements of up to 4 percentage points on established benchmarks using models ranging in size from 0.5 billion to 72 billion parameters.

Recent research demonstrates substantial improvements in the reasoning capabilities of large language models (LLMs) through a novel reinforcement learning framework, moving beyond simple performance gains to demonstrate a genuine increase in problem-solving ability. This advancement signifies a model’s capacity to successfully address previously intractable problems on a single attempt, representing a crucial step towards more adaptable and robust artificial intelligence systems. The study focuses on enhancing LLMs’ ability to tackle complex reasoning tasks across diverse domains, including mathematics, science, and code.

The core of this work lies in Guide-GRPO, an online training algorithm that builds upon standard reinforcement learning from human preferences (RLHF). RLHF involves training a model to align with human preferences through feedback on its outputs. Guide-GRPO extends this by incorporating targeted guidance, strategically applied to problems where initial attempts consistently fail, thereby expanding the model’s problem-solving repertoire and fostering a deeper understanding of underlying principles. This prioritises the development of genuine reasoning skills, contrasting with methods focused solely on maximising performance on existing datasets.

Researchers employed importance weighting, a technique that adjusts the probabilities assigned to different reasoning steps to prioritise those leading to successful outcomes. They also filtered prompts where all solution attempts were either uniformly correct or incorrect, concentrating training on more informative scenarios. This meticulous approach ensures the model receives targeted feedback and learns from its mistakes, rather than simply memorising patterns. The system’s ability to identify and address challenging problems enhances its learning process.

Careful ablation studies, where individual components of Guide-GRPO are systematically removed to assess their impact, provide a detailed understanding of its functionality and identify the key factors contributing to its success. These studies reveal that the adaptive guidance mechanism and the refined reinforcement learning framework are crucial for achieving significant improvements in reasoning performance. By systematically evaluating the impact of each component, researchers gain valuable insights into the algorithm’s inner workings and identify areas for further optimisation.

Theoretical analysis supports the algorithm’s learning efficiency, providing a formal justification for its effectiveness and demonstrating its potential for scalability. This analysis establishes a theoretical foundation for the observed improvements in reasoning performance and provides a framework for designing even more effective algorithms in the future.

The findings suggest that providing targeted guidance during training, coupled with a refined reinforcement learning framework, represents a promising avenue for enhancing the reasoning abilities of LLMs and unlocking their full potential. The ability to achieve significant improvements in reasoning performance across different model sizes and problem types demonstrates the robustness and versatility of the proposed approach.

Researchers highlight the importance of self-supervised learning as a complementary approach to reinforcement learning, enabling LLMs to acquire a broad range of knowledge and skills from unlabeled data. This combination of self-supervised and reinforcement learning allows LLMs to leverage both the vast amount of available data and the targeted feedback provided by the reinforcement learning framework.

Future research directions include exploring the use of more sophisticated guidance mechanisms, such as providing explanations or justifications for the correct answers, and developing algorithms that can automatically generate high-quality training data. These advancements will further enhance the effectiveness of reinforcement learning and unlock the full potential of LLMs for solving complex reasoning tasks.

Researchers are actively exploring the application of Guide-GRPO to a wide range of real-world problems, including scientific discovery, medical diagnosis, and financial modeling. These applications demonstrate the potential of LLMs to solve complex problems and contribute to a wide range of fields.

The team plans to release the code and data used in this research to the public, enabling other researchers to build upon their work and accelerate the development of AI. This commitment to open science fosters collaboration and innovation, and ensures that the benefits of AI are widely shared.

👉 More information
🗞 Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models
🧠 DOI: https://doi.org/10.48550/arXiv.2506.13923

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

December 29, 2025
Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

December 28, 2025
Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

December 27, 2025