The challenge of balancing exploration and exploitation lies at the heart of reinforcement learning, and recent work utilising verifiable rewards aims to improve the reasoning abilities of large language models. Peter Chen from Columbia, alongside Xiaopeng Li, Ziniu Li, and colleagues, investigate this trade-off, revealing surprising insights into how seemingly counterintuitive techniques can boost performance. The team demonstrates that both discouraging exploration and discouraging exploitation, through methods like entropy minimisation and the introduction of spurious rewards, paradoxically improve reasoning capabilities. Their research clarifies that clipping bias, introduced under spurious reward schemes, reduces uncertainty in the model’s outputs, and this effect is more impactful than simply minimising entropy alone. These findings offer a crucial understanding of why spurious rewards can be beneficial, and provide principles for training more effective reasoning systems.
Hypergeometric and Binomial Distribution Analysis
This document presents a rigorous mathematical analysis of random variables within algorithms, particularly those involving sampling and optimization. The research focuses on understanding the behaviour of these algorithms and proving properties about their performance, utilising concepts from hypergeometric and binomial distributions. The work investigates how variance and expected values of random variables change under specific conditions. The core of the research centres on a detailed proof concerning the variance of a random variable, denoted as ∆. Scientists examine how this variance is affected when comparing two binomial random variables, f and g.
The analysis demonstrates that the variance of ∆ is smaller when f is greater than g compared to when g is greater than f. This involves expressing ∆ in terms of f, g, and other variables, and leveraging the properties of both binomial and hypergeometric distributions. The research further explores the hypergeometric distribution, introducing a random variable Z to analyse the variance of ∆. Scientists calculate the conditional variance of ∆ given Z, revealing its dependence on the value of Z. The analysis proves that the variance is smaller when f exceeds g, demonstrating a key relationship between the variables.
Ultimately, this work provides a theoretical foundation for optimising algorithms by understanding how variance is affected by specific conditions. The research highlights the crucial role of the hypergeometric distribution in modelling sampling and comparison processes. By analysing conditional variance, scientists derive precise results that can inform algorithm design and improve performance.
Spurious Rewards and Policy Optimisation in RLVR
This research pioneers a detailed investigation into how clipping, policy entropy, and spurious rewards interact within reinforcement learning with verifiable rewards (RLVR), a technique for enhancing the reasoning capabilities of large language models. The study employs Group Relative Policy Optimization (GRPO) as its core reinforcement learning method, benefiting from its computational efficiency and reduced memory requirements. Researchers designed experiments within a Markov decision process framework, focusing on outcome-level rewards verified only at the completion of extended sequences, a departure from traditional reward structures. To rigorously assess the impact of spurious rewards, the team implemented a controlled experimental setup across multiple language model families, including Qwen-Math, Llama, and QwQ, encompassing models of 7, 8, and 32 billion parameters, and both base and distilled variants.
They systematically introduced random, misaligned rewards to explore their effects on model performance, carefully contrasting results with those obtained using accurate, verifiable rewards. The study quantifies the influence of clipping bias, a phenomenon where extreme probability values are suppressed, and its relationship to policy entropy, a measure of the randomness or determinism of the model’s actions. Researchers developed a novel one-step policy-entropy shift formulation to precisely capture the connection between clipping and policy entropy, demonstrating that clipping systematically reduces entropy and drives the policy toward more deterministic, higher-confidence outputs. This analysis revealed that while clipping alone does not constitute a meaningful learning signal under spurious rewards, its effect on reducing entropy is crucial for performance gains. The team’s findings overturn the prevailing view that improvements under spurious rewards are limited to potentially contaminated Qwen-Math models, demonstrating robust gains across diverse model families and sizes, and revealing a nuanced exploration-exploitation dynamic unique to RLVR.
Entropy Reduction Drives Language Model Improvement
Scientists have achieved a deeper understanding of how large language models learn through reinforcement learning with verifiable rewards (RLVR), a technique used to improve mathematical reasoning. This work investigates the seemingly paradoxical observation that both discouraging exploration and discouraging exploitation can simultaneously improve performance, a dynamic not previously understood in classical reinforcement learning. Experiments demonstrate that clipping bias, under spurious rewards, does not directly improve performance as a learning signal; instead, it systematically reduces policy entropy. This reduction in entropy drives the language model toward more deterministic and confident outputs, effectively mimicking the effects of entropy minimization.
Further analysis across multiple language model families, Qwen-Math, Llama, and QwQ, with sizes ranging from 7 billion to 32 billion parameters, reveals that improvements from spurious rewards are robust and not limited to specific models or datasets. Crucially, the team demonstrated that these gains are not attributable to clipping bias or the direct effects of policy entropy, overturning previous assumptions that improvements were confined to potentially contaminated models. The research establishes a novel one-step policy-entropy shift formulation, providing a deterministic link between clipping and policy entropy, and clarifying the mechanisms behind spurious-reward benefits in RLVR. These findings reveal a nuanced exploration-exploitation dynamic unique to RLVR, offering principles for more effective training and improved language model reasoning capabilities.
Spurious Rewards Enhance Language Model Confidence
This research clarifies the mechanisms underlying reinforcement learning with verifiable rewards, a technique used to enhance the reasoning capabilities of large language models. Scientists investigated how balancing exploration and exploitation impacts performance, focusing on the seemingly counterintuitive benefits of both discouraging exploration and discouraging exploitation. Results demonstrate that spurious rewards, while unrelated to correct answers, can improve performance in stronger language models by reducing policy entropy, essentially, making the model more confident in its responses. This effect stems from clipping bias, which regulates confidence levels, and is distinct from simply contaminating the training data.
The study confirms that random rewards can indeed enhance model performance, but this benefit is contingent on the strength of the language model itself, with stronger models realising gains while weaker models become unstable. Researchers acknowledge that the observed benefits are not universal and depend on the specific characteristics of the language model being trained. Future work should focus on further disentangling the complex interplay between exploration and exploitation to refine alignment dynamics and improve the effectiveness of reinforcement learning techniques for large language models.
👉 More information
🗞 Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
🧠 ArXiv: https://arxiv.org/abs/2512.16912
