The increasing size of modern artificial intelligence models drives demand for efficient numerical formats, and low-precision formats like NVFP4 offer significant speed and memory benefits. However, accurately training and running these models with such formats proves challenging, often leading to instability and performance loss. Jack Cook, Junxian Guo, and Guangxuan Xiao, from the Massachusetts Institute of Technology, alongside Yujun Lin and Song Han from both MIT and NVIDIA, now present a solution called Four Over Six, a refinement to the NVFP4 quantization algorithm. Their method addresses the problem of representing values accurately by evaluating multiple scaling options for each data block, improving performance and preventing training divergence, and demonstrating significant gains when pre-training large language models. This advancement promises to unlock greater efficiency in both the training and deployment of increasingly complex artificial intelligence systems.

Mitigating Performance Loss in 4-bit LLMs

Researchers are tackling the challenge of reducing the computational demands of large language models (LLMs) through quantization, a technique that lowers the precision of numerical values. While reducing precision to 4-bit formats like NVFP4 offers significant speed and memory benefits, it often leads to a decline in model performance. This research focuses on methods to minimize this performance loss and maintain accuracy. The team developed a technique called 2D block scaling, which divides the model’s weight matrix into blocks, assigning a unique scale factor to each block, preserving the matrix structure during training.

This approach enhances training stability and prevents significant performance degradation. To further refine the representation of these weight blocks, the team introduced a technique called 4/6, which improves the accuracy of the scale factors within each block. Experiments with Llama 3 and Qwen3 models, ranging from 1 to 70 billion parameters, demonstrated that 4/6 consistently improves performance, as measured by perplexity on the WikiText-2 dataset. The choice of block size, either 1×16 or 16×16, also impacts performance, requiring careful consideration during implementation. Combining 2D block scaling with the 4/6 technique effectively mitigates performance loss associated with 4-bit quantization, enabling more efficient and accessible LLMs.

Novel Quantization Mitigates Training Instability

Researchers have developed a new quantization technique, Four Over Six (4/6), to address challenges in training large language models with low-precision numerical formats like NVFP4. Utilizing formats such as NVFP4 offers benefits in speed and memory, but requires all matrix multiplication operands to be quantized, often leading to training instability and performance degradation. The team identified that standard NVFP4 quantization concentrates errors on near-maximal values within data blocks, limiting the accurate representation of values close to the block’s largest magnitude. To address this, the study pioneered a modification to the NVFP4 algorithm that evaluates two potential scale factors for each block of values, selectively scaling some blocks to a maximum value of 4 and others to 6.

This optimizes the representation of near-maximal values and prevents divergence in several training cases, bringing training loss significantly closer to that achieved with BF16 precision. Designed for efficient implementation on NVIDIA Blackwell GPUs, 4/6 improves downstream accuracy and integrates easily into existing post-training quantization methods. Experiments with transformer and hybrid model architectures demonstrate that scaling blocks to a maximum value of 4 results in a mean squared error of 0, compared to 4. 33 with standard NVFP4 quantization, offering a lightweight solution to improve numerical accuracy and deliver both speed improvements and natively quantized models.

Selective Scaling Boosts Low-Precision Accuracy

Researchers have developed a method called Four Over Six (4/6) to improve the accuracy of low-precision numerical formats, specifically NVFP4, which are increasingly used to accelerate computations due to their speed and memory efficiency. Current NVFP4 quantization methods can lead to performance degradation because they uniformly scale all data blocks, potentially losing accuracy in representing near-maximal values. The team discovered that scaling some blocks to a smaller range, while maintaining a larger range for others, can significantly improve representation of these critical values. Experiments revealed that near-maximal values within data blocks are primarily responsible for performance loss during quantization, as these values are often inaccurately represented by standard FP4 formats.

By adaptively scaling blocks, using a scale of 6 for some and 4 for others, the team could better preserve the accuracy of these near-maximal values. Testing 4/6 on Llama and Qwen language models, using WikiText-2 and C4 datasets, showed that simply scaling all blocks to 4 degrades performance compared to standard NVFP4 quantization. However, by intelligently selecting the scale, either 4 or 6, based on mean squared quantization error, the team achieved improved performance across various models and datasets. Specifically, using mean squared error to select the optimal scale consistently delivered better results than uniform scaling, demonstrating the effectiveness of the adaptive approach.

The team achieved word perplexity values of 35. 09 and 20. 48 on the WikiText-2 dataset, and 66. 32 and 37. The team discovered that evaluating multiple scale factors for each block of values during quantization enhances accuracy, particularly for near-maximal values which are prone to error in low-precision formats. Results demonstrate that Four Over Six mitigates divergence during pre-training across various model architectures and sizes, bringing training loss closer to that achieved with higher-precision formats. Furthermore, the method integrates effectively with existing post-training quantization techniques, consistently improving performance across a range of tasks. This work paves the way for more efficient training and deployment of large models using low-precision numerical formats, offering a promising path towards reduced computational costs and increased accessibility in artificial intelligence.

👉 More information
🗞 Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
🧠 ArXiv: https://arxiv.org/abs/2512.02010

Tags:

BF16 floating-point formats Hybrid Architectures low-precision numerical formats NVFP4 NVIDIA Blackwell GPUs Post-Training Quantization scale factors training divergence transformer models

Four over Six NVFP4 Quantization Achieves Improved Accuracy with Adaptive Block Scaling for Machine Learning

Mitigating Performance Loss in 4-bit LLMs

Novel Quantization Mitigates Training Instability

Selective Scaling Boosts Low-Precision Accuracy

Rohail T.

Latest Posts by Rohail T.:

Quantum Rydberg RF Receiver Enhanced with Metamaterial Lens Achieves Improved 2.2~GHz and 3.6~GHz Sensitivity

Gpu-portable Density Functional Theory Achieves 2.0-2.8 Speedups on AMD MI300A and Intel GH200 Architectures

Hybrid Quantum-Classical Autoencoders Match Classical Performance in Network Intrusion Detection, Enabling Stronger Zero-Day Generalization