Jet-RL Achieves 41% Faster FP8 Reinforcement Learning with Unified Precision Flow

Researchers are tackling a major bottleneck in reinforcement learning (RL) , the computational cost of training large language models. Haocheng Xi, Charlie Ruan, and Peiyuan Liao, alongside Yujun Lin, Han Cai et al from NVIDIA, have identified a critical flaw in current approaches to accelerate RL using FP8 precision: the mismatch between training and rollout phases causes instability and accuracy loss. Their new framework, Jet-RL, addresses this by unifying FP8 precision throughout both training and rollout, dramatically reducing numerical discrepancies and enabling significantly faster, more stable learning , experiments show up to a 41% speedup in training and a 16% end-to-end improvement without sacrificing accuracy.

The team achieved this by tackling the computational inefficiencies inherent in traditional RL, where the rollout phase often consumes over 70% of total training time. This work presents the first comprehensive study of FP8 RL training, revealing that the commonly used BF16-training + FP8-rollout strategy suffers from severe instability and accuracy collapse, particularly with long-horizon rollouts and challenging tasks.

The study unveils that these failures stem from a numerical mismatch between training and inference caused by the off-policy nature of the approach, a discrepancy that accumulates during extended reasoning sequences. Motivated by these observations, the researchers propose Jet-RL, which adopts a unified FP8 precision flow for both training and rollout, thereby minimizing these numerical discrepancies and eliminating the need for inefficient inter-step calibration. Extensive experiments validate the effectiveness of Jet-RL, demonstrating up to 33% speedup in the rollout phase, up to 41% speedup in the training phase, and a 16% end-to-end speedup over BF16 training. Crucially, Jet-RL maintains stable convergence across all settings and incurs negligible accuracy degradation, a significant improvement over existing methods.
The innovation lies in establishing a truly on-policy FP8 training paradigm, ensuring robustness and adaptability across diverse training configurations. By employing a mixed per-group and per-block quantization scheme alongside state-of-the-art FP8 GEMM kernels, Jet-RL unlocks substantial acceleration for end-to-end RL training, paving the way for more efficient and powerful LLMs capable of tackling increasingly complex reasoning tasks. This research establishes a foundation for future advancements in LLM training, potentially enabling the development of AI systems with enhanced problem-solving abilities and broader applications.

FP8 Reinforcement Learning Stability and Optimisation are crucial

Scientists identified a critical bottleneck in reinforcement learning (RL) training of large language models (LLMs): the rollout phase, which consumes over 70% of total training time. To address this, researchers undertook a comprehensive study of FP8 RL training, challenging the prevailing BF16-training + FP8-rollout strategy. Experiments revealed that this common approach suffers from training instability and accuracy collapse, particularly with long-horizon rollouts and complex tasks. The study pinpointed the root cause as numerical mismatch between training and inference arising from the off-policy nature of the method.

Motivated by these findings, the team engineered Jet-RL, a novel FP8 RL training framework designed for robust and stable optimisation. Crucially, Jet-RL adopts a unified FP8 precision flow for both training and rollout, minimising numerical discrepancies and eliminating the need for inefficient inter-step calibration. The researchers implemented this by converting all calculations, actor updates, policy evaluation, and rollout generation, to FP8 format, ensuring consistency throughout the entire training pipeline. This innovative approach contrasts sharply with existing methods that maintain BF16 precision during training while quantising to FP8 only for rollouts.

Figure 2 illustrates how the rollout phase dominates latency, exceeding 75% of total time for rollouts longer than 8k tokens, a bottleneck Jet-RL effectively alleviates. Figure 3 highlights the failure of the BF16-train-FP8-rollout method with increasing rollout length, while Jet-RL maintains performance, demonstrating the effectiveness of the unified precision flow. This work pioneers a new direction in efficient RL training, enabling the development of more powerful and resource-efficient LLMs.

SchoolJet-RL accelerates reinforcement learning with FP8 precision, enabling

Scientists have achieved a 33% speedup in the rollout phase of reinforcement learning (RL) training by employing a novel framework called Jet-RL. This breakthrough addresses a significant bottleneck in training large language models (LLMs), where the rollout phase traditionally consumes over 70% of total training time. The research team demonstrated that conventional BF16-training combined with FP8-rollout strategies suffer from instability and accuracy collapse, particularly with long-horizon rollouts and complex tasks. Analysis revealed that this stems from numerical mismatches between training and inference due to the off-policy nature of the approach.

To overcome these limitations, researchers developed Jet-RL, an FP8 RL training framework that utilizes a unified FP8 precision flow for both training and rollout. This innovative approach minimizes numerical discrepancies and eliminates the need for inefficient inter-step calibration, resulting in remarkably stable and robust RL optimization. Experiments confirm that Jet-RL delivers up to a 41% speedup in the training phase itself, culminating in a 16% end-to-end speedup compared to standard BF16 training. Measurements show that the method maintains stable convergence across all tested settings while incurring only negligible accuracy degradation.

The team designed the framework to use identical quantization precision for both training and inference, effectively resolving policy mismatch and streamlining the optimization process. Jet-RL adopts a mixed per-group and per-block quantization scheme, leveraging state-of-the-art FP8 GEMM kernels to accelerate end-to-end RL training. Comprehensive experiments across diverse models, datasets, and rollout configurations validate the effectiveness of Jet-RL, successfully stabilizing training and minimizing divergence between training and rollouts. Specifically, tests with a 32B model achieved a 1.33x speedup in the rollout phase, while an 8B model saw a 1.41x speedup in the training phase and a 1.16x end-to-end speedup.

Compared to BF16-train-FP8-rollout methods that typically incur over 5% performance degradation, Jet-RL reduces this to approximately 1%. These findings confirm that Jet-RL provides a robust solution for efficient low-precision RL training, enabling significant acceleration without compromising performance. The work identifies that the common BF16-train-FP8-rollout paradigm leads to training instability and accuracy collapse under long-rollout generation and challenging tasks.

Jet-RL resolves precision mismatches and boosts speed significantly

Scientists have demonstrated significant instability and accuracy collapse when employing a common reinforcement learning (RL) training strategy involving BF16 precision for training and FP8 precision for rollouts, particularly with long-horizon tasks and complex challenges. Their analysis reveals that this performance degradation stems from numerical mismatches between the training and inference processes inherent in this off-policy approach. To address this, researchers introduced Jet-RL, a novel FP8 RL framework utilising unified FP8 precision for both training and rollout, effectively minimising these discrepancies and eliminating the need for costly inter-step calibration. Extensive experimentation confirmed Jet-RL’s effectiveness, achieving speedups of up to 33% in the rollout phase, 41% in training, and 16% end-to-end compared to BF16 training, all while maintaining stable convergence and negligible accuracy loss.

The authors acknowledge that their findings are based on specific large language models and tasks, and further research is needed to explore the generalizability of Jet-RL across diverse architectures and problem domains. Future work could investigate adaptive precision schemes or explore combinations with other acceleration techniques to further optimise RL training pipelines. This work establishes a robust and efficient method for FP8 RL training, paving the way for more scalable and resource-conscious development of intelligent systems.

👉 More information
🗞 Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow
🧠 ArXiv: https://arxiv.org/abs/2601.14243

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Catmaster Achieves Faster Heterogeneous Catalysis Research Using LLM-Driven Workflows

Catmaster Achieves Faster Heterogeneous Catalysis Research Using LLM-Driven Workflows

January 23, 2026
Mixture of Experts Vision Transformer Achieves High-Fidelity Surface Code Decoding

Mixture of Experts Vision Transformer Achieves High-Fidelity Surface Code Decoding

January 23, 2026
Microscopic Origin Achieves Clear Formulation of Orbital Magnetization in Chiral Superconductors

Microscopic Origin Achieves Clear Formulation of Orbital Magnetization in Chiral Superconductors

January 23, 2026