Researchers restore out-of-distribution performance by 8.97% to 19.66% with reinforcement learning fine-tuning

Large language models increasingly rely on post-training methods to adapt to specific tasks, with supervised fine-tuning and reinforcement learning fine-tuning being prominent approaches, yet a clear understanding of how these methods impact performance remains elusive. Hangzhan Jin from PolyTechnique Montreal, Sicheng Lv and Sifan Wu from Mila, along with Mohammad Hamdaqa from PolyTechnique Montreal, investigate how these two stages reshape a language model’s underlying representation and its ability to generalise to unseen data. Their research demonstrates that reinforcement learning can often restore performance lost during supervised fine-tuning, but this recovery is limited when the initial stage causes significant overfitting, indicating that reinforcement learning primarily corrects directional shifts in the model’s internal representation rather than discovering entirely new solutions. This work introduces a novel spectrum-based diagnostic tool and reveals that inexpensive techniques, such as focusing on a small number of key parameters or layers, can effectively recover much of the lost performance before embarking on computationally expensive reinforcement learning.

Rotation dominates fine-tuning of large models

This research explores why fine-tuning large language models (LLMs) appears to prioritize rotating weight matrices over altering their fundamental values. The optimization process naturally favors changes that involve rotations, as these preserve the function of the linear transformation to a first approximation, while altering magnitudes potentially disrupts learned representations. Rotations also incur a lower cost in standard loss functions and regularization terms, and are numerically more stable than changes to singular values. The analysis considers a simplified section within a Transformer block and examines how a single gradient descent step affects the weight matrices.

By applying an update rule involving infinitesimal rotations, the research demonstrates that the cost of these rotational changes is significantly lower than direct alterations to singular values. This suggests that the optimization process gravitates towards solutions where weight matrices rotate, rather than undergo substantial changes in magnitude. This finding explains why fine-tuning often results in stable singular values alongside significant rotations of the weight matrices, and suggests that the optimization landscape is shaped by this preference. The research provides a theoretical justification for regularization techniques that encourage orthogonal transformations, and hints that the ability of LLMs to generalize may be linked to the preservation of underlying representations through rotations.

The research builds upon the idea of an orthogonal-rotation gauge, meaning the optimization process is constrained to move along a path of orthogonal transformations. This constraint reduces the cost of weight updates and helps preserve the underlying representations. The analysis is mathematically rigorous and has practical implications for improving fine-tuning techniques. In conclusion, this analysis provides compelling evidence that rotation plays a dominant role in the fine-tuning of LLMs, offering a solid theoretical foundation for understanding this phenomenon and having important implications for improving model performance and generalization ability.

Supervised and Reinforcement Fine-tuning Impacts Generalization

Researchers investigated how supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RL-FT) reshape large language models (LLMs) and their ability to perform on unseen tasks, focusing on out-of-distribution (OOD) generalization. They performed full-parameter SFT and RL-FT on both Llama-3. 2-11B and Qwen-2. 5-7B models, meticulously tracking both in-distribution and OOD performance using a card-game benchmark to probe arithmetic reasoning and generalization ability. Experiments revealed that OOD generalization initially peaks during SFT but subsequently declines as training continues, even while performance on the training data improves, demonstrating a trade-off between memorization and generalization.

The team then demonstrated that the subsequent RL-FT stage substantially mitigates this OOD decay, restoring up to 85% of the OOD performance lost during SFT for Llama-3. 2-11B and up to 99% for Qwen-2. 5-7B, all while maintaining strong performance on the training data. However, prolonged SFT beyond a certain threshold results in lost ability that RL-FT can only partially recover, suggesting that RL primarily restores competencies rather than discovering fundamentally new solutions. To understand how these changes occur, researchers analyzed the singular value decomposition of the models’ parameter matrices, revealing that shifts in the direction of singular vectors have a much larger impact on performance than changes in the singular values themselves.

These shifts concentrate on directions linked to the largest and smallest singular values, leaving the bulk of the model’s intrinsic capacity unchanged. This demonstrates that RL primarily counteracts SFT-induced directional drift. Furthermore, restoring the directions of singular vectors corresponding to the top 20% of singular values or the first 25% of layers recovers 70 to 80% of a model’s OOD performance, highlighting the effectiveness of low-rank and shallow recovery techniques as inexpensive alternatives to costly RL fine-tuning.

Reinforcement Learning Restores Lost Generalization Ability

Researchers have demonstrated that reinforcement learning (RL) primarily restores capabilities lost during supervised fine-tuning (SFT) of large language models, rather than creating fundamentally new abilities. This finding reconciles previous observations of improved out-of-distribution (OOD) performance with a mechanistic understanding of how these models learn. Experiments using both the Llama-11B and Qwen-7B models, assessed on a challenging arithmetic reasoning benchmark, reveal a consistent pattern of performance evolution during post-training. Initially, OOD generalization peaks early in SFT but then declines as training continues, even while performance on the training data improves.

Crucially, the subsequent RL stage substantially mitigates this OOD decay, restoring up to 85% of lost performance for Llama-11B and 99% for Qwen-7B, all while maintaining strong performance on the training data. However, this restoration has limits; prolonged SFT beyond a certain threshold results in a loss of ability that RL cannot fully recover. Detailed spectral analysis of the model’s weight matrices reveals that the model’s intrinsic capacity remains largely unchanged throughout training. Instead, degradation and subsequent recovery of OOD performance correlate almost entirely with rotations of singular vectors at the extremes of the spectrum, while the singular values themselves remain relatively stable.

This indicates that directional shifts in these key components, rather than changes in magnitude, govern the model’s performance. Researchers found that restoring the directions of singular vectors corresponding to the top 20% of singular values, or the first 25% of layers, recovers 70-80% of a model’s OOD performance. These findings highlight inexpensive recovery methods, such as low-rank UV merging and shallow-layer resets, that practitioners can employ before resorting to costly RL fine-tuning.

Directional Shifts Restore Generalisation Ability

The research investigates how two common techniques, supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RL-FT), affect the performance of large language models, particularly their ability to generalise to new, unseen data. The findings demonstrate that RL-FT can often restore performance lost during SFT, but this recovery is limited when SFT induces significant overfitting and a substantial shift in the model’s internal representation.

👉 More information
🗞 RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs
🧠 ArXiv: https://arxiv.org/abs/2508.16546

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

Toyota & ORCA Achieve 80% Compute Time Reduction Using Quantum Reservoir Computing

Toyota & ORCA Achieve 80% Compute Time Reduction Using Quantum Reservoir Computing

January 14, 2026
GlobalFoundries Acquires Synopsys’ Processor IP to Accelerate Physical AI

GlobalFoundries Acquires Synopsys’ Processor IP to Accelerate Physical AI

January 14, 2026
Fujitsu & Toyota Systems Accelerate Automotive Design 20x with Quantum-Inspired AI

Fujitsu & Toyota Systems Accelerate Automotive Design 20x with Quantum-Inspired AI

January 14, 2026