Rlvr Provably Learns Off the Principals, Localizing Updates to Preferred Regions with KL-Constraint

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) consistently enhance the reasoning capabilities of large language models, yet the mechanism behind this improvement remains surprisingly elusive. Hanqing Zhu, Zhenyu Zhang, and Hanxian Huang, alongside DiJia Su, Zechun Liu, and Jiawei Zhao, investigate this phenomenon and demonstrate that the apparent sparsity of parameter updates is not the full story. Their work reveals a consistent bias in how RLVR optimises models, consistently favouring specific regions of the parameter space and achieving gains through minimal disruption of the model’s core structure. This research provides the first detailed, parameter-level characterisation of RLVR’s learning dynamics, establishing a clear distinction between its optimisation regime and that of traditional supervised fine-tuning, and paving the way for the development of more effective, geometry-aware learning algorithms specifically designed for reinforcement learning.

RLHF Preserves LLM Internal Representations

This research investigates how large language models (LLMs) learn during reinforcement learning (RL) fine-tuning, particularly when aligned with human preferences. Contrary to expectations, the findings demonstrate that RL primarily adjusts the strength of existing connections, rather than drastically reshaping the LLM’s internal structure. The LLM’s core knowledge and underlying structure remain remarkably stable throughout the learning process, with the distribution of weight importance remaining consistent. The team discovered that updates during RL consistently target a specific subset of parameters, regardless of the dataset or algorithm used, suggesting a deterministic optimization process.

The primary mechanism of learning appears to be scaling existing knowledge, adjusting the magnitude of weights, rather than learning entirely new features. Analysis of the model’s principal components confirms that these components are largely preserved during RL, further supporting the idea that the core structure of knowledge remains intact. A technique called RLVR, which regularizes the learning process, appears to contribute to this stability. The research employed spectral analysis, principal component analysis, and update mask analysis to examine changes in the LLM’s internal representations and identify patterns of weight updates.

These methods, combined with visualization techniques, provided a comprehensive understanding of the learning process, focusing on several LLMs including DS-Distill-Qwen-1. 5B, Qwen3-14B, and Llama-3. 1-8B. The results consistently demonstrate that the optimization process targets the same subset of parameters, maintaining a stable spectrum and minimal rotation of principal subspaces. The models achieve significant performance gains on various tasks after RL fine-tuning, and masking strategies can further improve performance and reduce computational cost. This research provides valuable insights into the learning dynamics of LLMs during RL fine-tuning, challenging the conventional wisdom that RL drastically reshapes internal representations. Researchers uncovered a persistent, model-conditioned optimization bias, demonstrating that updates consistently localize to preferred parameter regions and remain stable across different algorithms and datasets. This suggests that the model itself guides the learning process, rather than simply responding to the training data. Scientists developed a Three-Gate Theory to explain this phenomenon.

The first gate, the KL Anchor, enforces a KL-constrained update during each on-policy step, regulating the learning process. The second gate, the Model Geometry gate, steers updates off principal directions in weight space, guiding them toward low-curvature subspaces within the pretrained model’s structure. This gate explains the model-conditioned nature of the bias, arising from the existing structure of the model. Finally, the third gate, the Precision gate, acts as a filter, attenuating micro-updates in non-preferred regions and contributing to the appearance of sparsity. Researchers validated this theory through a comprehensive experimental suite, revealing that RLVR learns off the principal directions, operating in a distinct regime from Supervised Fine-Tuning (SFT).

The team demonstrated that RLVR preserves the pretrained spectral structure with minimal distortion, whereas SFT significantly distorts it. Furthermore, they showed that RLVR avoids updating principal weights, while parameter-efficient SFT methods target these same weights. Experiments involving function-preserving orthogonal rotations confirmed the dependence of RLVR on the pretrained geometry, supporting the concept of a model-conditioned optimization bias. Researchers resolved a paradox, high performance from sparse updates, by uncovering a persistent, model-conditioned optimization bias that concentrates updates into a stable subset of parameters. This bias remains consistent across different algorithms and datasets, suggesting an inherent property of the model itself. The team formalized this behavior with a Three-Gate Theory, explaining how RLVR navigates the optimization landscape.

The first gate, the KL Anchor, enforces a KL-constrained update at each step, while the second gate, the Model Geometry gate, steers updates off principal directions toward low-curvature subspaces within the pretrained model’s structure. This geometry gate explains the model-conditioned nature of the bias, arising from the existing landscape. Finally, the third gate, the Precision gate, filters micro-updates in non-preferred regions, contributing to the observed sparsity. Experiments demonstrate that RLVR learns off principal directions in weight space, preserving the pretrained spectral structure with minimal distortion, whereas supervised fine-tuning (SFT) distorts this structure.

Specifically, RLVR avoids updating principal weights, while SFT targets them. Researchers found that function-preserving orthogonal rotations abolish the effect of update locality overlap, further confirming the model-conditioned nature of the optimization bias. This research challenges the direct application of SFT-era parameter-efficient fine-tuning (PEFT) methods to RLVR. Sparse fine-tuning significantly degrades performance, while updating non-principal weights closely tracks the dense RLVR trajectory. Analysis of LoRA variants reveals that advanced techniques designed for SFT are misaligned with RLVR’s optimization geometry, suggesting a need for RLVR-native learning algorithms.

👉 More information
🗞 The Path Not Taken: RLVR Provably Learns Off the Principals
🧠 ArXiv: https://arxiv.org/abs/2511.08567

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

New Material Hosts ‘Majorana’ Particles for Robust Quantum Computing Networks

New Material Hosts ‘Majorana’ Particles for Robust Quantum Computing Networks

February 10, 2026
Hybrid Light-Matter Particles Unlock Potential for Terahertz Quantum Technology

Hybrid Light-Matter Particles Unlock Potential for Terahertz Quantum Technology

February 10, 2026
Entangled Qubits Overcome Limits to Precision Measurement of Unknown Rotations

Entangled Qubits Overcome Limits to Precision Measurement of Unknown Rotations

February 10, 2026