The instability often encountered when refining large language models using reinforcement learning stems from a surprising source, namely the way computers represent numbers. Penghui Qi, alongside Zichen Liu from Sea AI Lab and the National University of Singapore, and Xiangxin Zhou from Sea AI Lab, demonstrate that this mismatch between training and actual use arises from rounding errors introduced by the commonly used BF16 format. Their research reveals a simple yet powerful solution: reverting to the older FP16 format effectively eliminates this inconsistency, leading to more stable and faster learning. This change, requiring minimal code adjustments and no architectural modifications, consistently improves performance across a range of tasks and algorithms, prompting a re-evaluation of precision choices in reinforcement learning.
Floating-Point Precision Stabilises Language Model Learning
This study pioneers a surprisingly simple solution to instability in reinforcement learning fine-tuning of large language models, identifying floating-point precision as the root cause of discrepancies between training and inference policies. Researchers discovered that the widely adopted BFloat16 format introduces rounding errors that accumulate during training, causing divergence between learning and deployment policies. To address this, the team systematically investigated switching to FP16 during the entire reinforcement learning process. Experiments employed diverse settings, including algorithms such as GRPO, GSPO, TIS, MIS, and PG, and model families like R1D, Qwen, and OctoThinker.
Rigorous testing was conducted across two independent frameworks, VeRL and Oat, to ensure the robustness of the findings. Results demonstrate that switching to FP16 eliminates the need for complex algorithmic workarounds previously proposed to correct for the training-inference mismatch, restoring reinforcement learning to its purest form. By eliminating computational overhead, the study restores a straightforward importance-weighted policy gradient approach and closes the deployment gap, ensuring final model parameters are optimized for real-world applications.
FP16 Precision Stabilizes Language Model Reinforcement Learning
Researchers have achieved a significant breakthrough in reinforcement learning (RL) fine-tuning of large language models (LLMs) by identifying and resolving a key source of instability. The team discovered that the widely adopted BF16 precision standard introduces rounding errors that create a mismatch between training and inference policies, hindering optimization. Extensive testing across diverse settings, including algorithms such as GRPO, GSPO, and PG, and model families like R1D, Qwen, and OctoThinker, consistently showed that FP16 delivers improved performance. For example, tests using the Sanity GRPO algorithm showed a reward of approximately 0.
9 at 2000 training steps with FP16, while BF16 plateaued around 0. 7. Further analysis with the OctoThinker model using GRPO showed FP16 achieving a reward of approximately 0. 65 at 1000 training steps, while BF16 remained below 0. 5. These results, validated across frameworks like VeRL and Oat, eliminate the need for complex algorithmic workarounds and close the deployment gap, offering a simpler and more robust approach to RL fine-tuning.
FP16 Precision Stabilizes Language Model Fine-tuning
This research demonstrates that a simple change in numerical precision significantly improves the stability and performance of reinforcement learning fine-tuning for large language models. Scientists discovered that the widely used BF16 format introduces rounding errors that create inconsistencies between the training and inference stages of these models, leading to unstable optimization. By reverting to FP16 precision, the team achieved more stable training, faster convergence, and stronger overall performance. The findings reveal that existing algorithmic corrections designed to address this training-inference mismatch often fall short, exhibiting instability or slow convergence when used with BF16. This research offers a simpler and more robust approach to RL fine-tuning.
👉 More information
🗞 Defeating the Training-Inference Mismatch via FP16
🧠 ArXiv: https://arxiv.org/abs/2510.26788
