The increasing demand for computational power in artificial intelligence drives innovation in numerical precision, and a team led by Angelika Schwarz, Anton Anders, and Cole Brower from NVIDIA Corporation addresses this challenge with a new approach to matrix multiplication. Their work focuses on leveraging the efficiency of low-precision hardware, such as Tensor Cores, while guaranteeing the accuracy traditionally associated with double-precision calculations. The researchers developed Automatic Dynamic Precision (ADP), a system that intelligently manages numerical decomposition to emulate double-precision results, and importantly, does so without requiring communication between the computer’s main processor and the graphics processing unit. Validated against rigorous testing, ADP consistently maintains high fidelity and achieves significant speedups, up to 2. 3times faster on Blackwell GB200 hardware, demonstrating that low-precision accelerators can deliver both performance and reliability for demanding scientific workloads.

Low Precision Matrix Multiplication for HPC

Modern high-performance computing relies heavily on matrix multiplication, but achieving both peak performance and accuracy with limited precision arithmetic, such as INT8 or FP8, presents a significant challenge. Researchers are developing methods to perform matrix multiplication with high performance and accuracy, even when using lower-precision arithmetic, which is crucial for maximizing the utilization of modern hardware like GPUs with Tensor Cores. This work is essential for a wide range of scientific applications, including climate modeling, computational fluid dynamics, and machine learning. A central technique involves Ozaki decomposition, which transforms floating-point matrix multiplication into a sequence of integer operations.

This allows the bulk of the computation to be performed in faster integer arithmetic, with corrections applied to maintain accuracy. Leveraging integer arithmetic, such as INT8 and INT4, is a core strategy, as integer operations are generally faster and more energy-efficient on many hardware platforms. Precision refinement techniques, including iterative refinement and mixed-precision arithmetic, further improve the accuracy of lower-precision calculations. Researchers are also utilizing Tensor Cores, specialized matrix multiplication units found in modern GPUs, to achieve high throughput. New data formats, like FP8, are designed to improve the efficiency of deep learning computations.

These combined approaches demonstrate significant performance improvements, achieving speeds comparable to or better than traditional floating-point matrix multiplication, while maintaining high accuracy even with lower-precision arithmetic. This research highlights the viability of lower-precision arithmetic for matrix multiplication, provided that appropriate techniques are employed to maintain accuracy. The design of numerical algorithms should be closely aligned with the characteristics of the underlying hardware, and continued optimization is crucial. This addresses a critical gap in high-performance computing, where GPUs increasingly utilize low-precision formats like FP16 and FP8, but maintaining double-precision accuracy remains challenging. The core of this innovation is the Exponent Span Capacity, a method for conservatively estimating the necessary decomposition parameters, or slices, required to achieve FP64-level accuracy, independent of specific hardware. To optimize computational efficiency, the researchers pioneered an unsigned integer slicing scheme, improving upon existing Ozaki-style decompositions.

Traditional methods store each slice as a signed 8-bit integer, wasting capacity by redundantly storing sign bits and requiring more slices than necessary. Instead, the team encoded the sign only in the leading slice, representing subsequent slices as unsigned 8-bit integers, reducing the slice count and minimizing computational waste. This approach leverages the native support for mixed signed-unsigned arithmetic in NVIDIA’s integer Tensor Cores, enabling a more efficient use of hardware for high-precision emulation. The team further developed a method for estimating the Exponent Span Capacity, which determines the number of bits needed to maintain desired accuracy during matrix multiplication.

Validated against recently proposed BLAS grading tests, the approach consistently preserves FP64 fidelity while incurring less than 10% runtime overhead, achieving up to 2. 3x and 13. This addresses a growing disparity between the hardware shift towards low-precision formats, like FP16, FP8, and block-scaled FP4, and the continued need for FP64 accuracy in high-performance scientific computing. The team introduced an unsigned integer slicing scheme that improves the efficiency of Ozaki-style decompositions by maximizing mantissa utilization and reducing computational waste. At the core of ADP is the Exponent Span Capacity (ESC), a novel estimator that conservatively determines the optimal number of slices required to achieve FP64 accuracy based on the characteristics of the input data.

This eliminates the need for users to manually specify decomposition parameters, ensuring both safety and efficiency. The ADP framework integrates ESC with exception handling and runtime heuristics, seamlessly falling back to native FP64 when necessary, guaranteeing accuracy. Rigorous validation using recently introduced BLAS grading tests demonstrates that ADP consistently preserves FP64 fidelity even with challenging inputs, incurring less than 10% runtime overhead. On NVIDIA Blackwell GB200 GPUs, the approach achieves up to a 2. 3x speedup over native FP64 GEMM, while on RTX Pro 6000 Blackwell Server Edition GPUs, it delivers an even more substantial 13. 2x performance improvement. These results demonstrate that low-precision accelerators, when combined with ADP, can provide a practical and production-ready foundation for high-fidelity and high-performance scientific computing workloads.

👉 More information
🗞 Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme
🧠 ArXiv: https://arxiv.org/abs/2511.13778

Tags:

automatic dynamic precision BLAS grading exponent span capacity FP16 FP4 FP64 emulation GEMM low-precision formats Ozaki decompositions scientific computing workloads

Guaranteed DGEMM Accuracy with Reduced Precision Tensor Cores Achieves 2.3x Throughput and 10% Reliability Gains

Low Precision Matrix Multiplication for HPC

Rohail T.

Latest Posts by Rohail T.:

Accurate Quantum Sensing Now Accounts for Real-World Limitations

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently