Microscaling FP4 Quantization: MR-GPTQ Achieves 6x Speedup, Bridging Promise and Performance Gaps

Recent advances in hardware acceleration offer the potential to revolutionise large language model (LLM) inference through the use of microscaling 4-bit floating-point formats, but realising these benefits in practice remains a significant challenge. Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, and colleagues demonstrate that existing methods struggle to fully utilise formats like MXFP4 and NVFP4, due to inherent limitations in their design. This research presents the first comprehensive study of these formats, revealing that the small group size of NVFP4 hinders outlier mitigation, and the power-of-two scale of MXFP4 introduces substantial accuracy errors. To address these issues, the team introduces Micro-Rotated-GPTQ (MR-GPTQ), a novel quantization algorithm specifically tailored to the unique properties of FP4, achieving significant speedups, up to 3. 6x layer-wise on NVIDIA B200 and 6x on RTX5090, while matching or exceeding the accuracy of current state-of-the-art methods and substantially improving MXFP4 performance. This work establishes that, with format-specialised techniques like MR-GPTQ, FP4 formats can unlock a new level of accuracy and performance for LLM inference.

Llama-3-8B Performance With Weight Transformations

Researchers evaluated the performance of the Llama-3-8B language model when different mathematical transformations were applied to its internal weights, testing Discrete Cosine Transform, Discrete Sine Transform, and specialized scaling methods to determine if they could maintain accuracy while compressing the model’s size. The study focused on two 4-bit weight formats, NVFP4 and MXFP4, comparing their performance to a full-precision (16-bit) baseline, and measured the model’s performance across a range of tasks using an accuracy scoring system. Key observations reveal that full precision consistently achieves the highest scores, but compressed 4-bit formats can approach similar levels of accuracy. The GPTQ quantization method consistently performed well, often nearing the performance of full precision and surpassing other transformations, while the effectiveness of other transformations varied depending on the chosen format, with MXFP4 generally yielding better scores than NVFP4.

FP4 Quantization with Micro-Rotated GPTQ Algorithm

This study investigates recently developed 4-bit floating-point formats, MXFP4 and NVFP4, for accelerating large language model inference. Researchers discovered that standard quantization techniques struggle to fully utilize these formats, revealing limitations in their performance, and engineered Micro-Rotated-GPTQ (MR-GPTQ), a novel algorithm tailored for FP4 formats. MR-GPTQ employs block-wise Hadamard transforms to normalize weights and activations, mitigating accuracy loss associated with both MXFP4 and NVFP4, and integrates seamlessly with NVIDIA Blackwell GPUs without performance overhead. Optimized activation re-ordering and format-specific scale search strategies further enhance performance, and high-performance GPU kernels were developed to support MR-GPTQ, achieving substantial speedups on NVIDIA B200 and RTX5090 GPUs. Their study reveals that realizing the efficiency gains of these formats requires addressing inherent limitations in their design, as traditional outlier mitigation techniques struggle with NVFP4 due to its small group size, and MXFP4 suffers from accuracy degradation caused by its power-of-two scale quantization. To overcome these challenges, the scientists introduced Micro-Rotated-GPTQ (MR-GPTQ), a novel quantization algorithm tailored to the unique properties of FP4 formats. MR-GPTQ utilizes block-wise Hadamard transforms and format-specific optimizations to improve accuracy, and is supported by newly developed, high-performance GPU kernels that enable rotation fusion into the weights and fast online computation of activations with negligible overhead.

Results show substantial speedups on NVIDIA B200 and RTX5090 GPUs, translating to significant improvements in inference speed, and the team achieved recovery of up to 98-99% of baseline FP16 accuracy for large models using both formats. The research team’s analysis revealed that MXFP4 induces major accuracy drops, while NVFP4 exhibits less significant loss, and they discovered that the distribution of representable values in these formats changes quantization dynamics, requiring specialized approaches to mitigate outliers and preserve accuracy. Researchers addressed limitations in these formats by introducing Micro-Rotated-GPTQ, a novel variant of the GPTQ quantization algorithm specifically adapted for MXFP4 and NVFP4, leveraging block-wise Hadamard transforms and format-specific optimizations to improve accuracy and efficiency. To support Micro-Rotated-GPTQ, the team developed QuTLASS, a suite of high-performance GPU kernels designed to implement the micro-rotations with minimal overhead. Extensive evaluation demonstrates that this method matches or surpasses the accuracy of state-of-the-art techniques, significantly enhancing the performance of MXFP4 and bringing it closer to that of NVFP4. The results indicate that while FP4 formats do not automatically provide improvements over INT4, specialized methods like Micro-Rotated-GPTQ can unlock new possibilities for balancing accuracy and performance in large language models.

👉 More information
🗞 Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
🧠 ArXiv: https://arxiv.org/abs/2509.23202

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Nodal-line Quantum Effects Enable Anomalous and Nonlinear Hall Effects in NbMnP

Nodal-line Quantum Effects Enable Anomalous and Nonlinear Hall Effects in NbMnP

January 14, 2026
Strain-engineered Graphene Achieves Robust Filamentary Superconductivity Via Pair Density Waves

Strain-engineered Graphene Achieves Robust Filamentary Superconductivity Via Pair Density Waves

January 14, 2026
Homodyne Detection Enables Observation of Forbidden Second Harmonics in 2D Crystals

Homodyne Detection Enables Observation of Forbidden Second Harmonics in 2D Crystals

January 14, 2026