The challenge of efficiently training large language models to reason effectively receives a significant boost from new research into exploration strategies, as Runpeng Dai, Linfeng Song, and Haolin Liu, from their respective institutions, demonstrate a novel approach to reinforcement learning. Current methods often struggle with poor exploration, leading to limited improvement and predictable outputs, but this team introduces Curiosity-Driven Exploration, a framework that harnesses the model’s own intrinsic curiosity to guide the learning process. By formalising curiosity through signals derived from both the model’s response generation and its evaluation of outcomes, the researchers create an exploration bonus that encourages diverse and accurate responses, achieving approximately a three-point improvement over standard methods on challenging reasoning benchmarks. This work not only advances the performance of language models, but also identifies a critical calibration issue within reinforcement learning, offering valuable insight into common failure modes and paving the way for more robust and reliable artificial intelligence.
Linear MDPs and Multi-Head Critics for Reinforcement Learning
This research focuses on improving how reinforcement learning agents explore their environment, particularly when dealing with complex tasks where rewards are not immediately obvious. Scientists developed a method for estimating an exploration bonus, a signal that encourages the agent to try new things, within a framework of linear Markov Decision Processes. The approach combines bootstrapping, a resampling technique, with a multi-head critic, which uses multiple value estimators to reduce uncertainty, allowing for a more accurate assessment of the bonus and guiding the agent towards more effective exploration strategies. The team provides theoretical guarantees, proving that the bonus estimate converges to the true value under specific conditions, validating the method and establishing its potential for broader application. The research builds upon established concepts in machine learning and statistics, simplifying the learning problem by representing rewards and transitions linearly and using ridge regression to prevent overfitting. By carefully combining these techniques, the team addresses the challenge of uncertainty in the exploration bonus, leading to more robust and efficient reinforcement learning.
Curiosity Drives Enhanced Reasoning in Language Models
This research presents a breakthrough in enhancing the reasoning abilities of large language models through a novel reinforcement learning framework called Curiosity-Driven Exploration. Scientists addressed the common problem of premature convergence and a loss of diversity in LLMs by leveraging intrinsic curiosity to guide the exploration process. This work formalizes curiosity using signals from both the actor and critic components of the reinforcement learning system. Researchers measure the actor’s curiosity by perplexity over generated responses, while the critic utilizes the variance of value estimates from a multi-head architecture.
Experiments demonstrate that incorporating these curiosity signals as exploration bonuses within the reinforcement learning framework yields significant improvements in mathematical reasoning benchmarks, achieving an approximate three-point improvement on the AIME benchmarks. Detailed analysis reveals that the actor-wise bonus penalizes overconfident errors and promotes diversity in correct responses, while the critic-wise bonus aligns with established count-based exploration techniques. Further investigation identified a calibration collapse mechanism within standard reinforcement learning for LLMs, shedding light on common failure modes. The team rigorously connected the standard deviation across multi-head critics to a consistent estimator for pseudo-count exploration, validating this connection with empirical results.
Curiosity Drives Improved Mathematical Reasoning Performance
Curiosity-Driven Exploration represents a significant advance in reinforcement learning with verifiable rewards, a technique used to improve the reasoning capabilities of large language models. Researchers developed a framework that incorporates intrinsic curiosity, signalled by both the model’s actor and critic components, to guide the learning process. By using perplexity and value variance as exploration bonuses, the method encourages more diverse and rigorous reasoning, addressing the common problem of premature convergence and a loss of diversity in standard reinforcement learning approaches. Empirical results demonstrate that this approach achieves approximately a three-point improvement over existing methods on challenging mathematical reasoning benchmarks.
Furthermore, analysis revealed a calibration collapse mechanism within reinforcement learning with verifiable rewards, offering new insight into the causes of hallucination in large language models. The team hypothesizes that this collapse stems from a reward design that prioritizes correct final outcomes over the quality of intermediate reasoning steps, and they demonstrate that incorporating alternative reward structures, such as a perplexity-based bonus, can improve performance. This work establishes a lightweight and effective method for enhancing agent learning and sheds light on fundamental challenges in language model reasoning.
👉 More information
🗞 CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models
🧠 ArXiv: https://arxiv.org/abs/2509.09675
