Chinese Remainder Theorem Emulation Achieves 4.4x, 6.5x Speedup for Matrix Multiplication on Low-Precision Hardware

The increasing demand for computational power drives innovation in how computers perform fundamental operations, such as matrix multiplication, and researchers continually seek ways to accelerate these processes. Yuki Uchino from RIKEN Center for Computational Science, Qianxiang Ma, Toshiyuki Imamura from RIKEN Center for Computational Science, and Katsuhisa Ozaki from Shibaura Institute of Technology, alongside Patrick Lars Gutsche from Ecole Normale Superieure de Lyon, have developed new methods to efficiently perform complex matrix multiplication using low-precision hardware. Building on previous work, the team proposes a high-performance emulation technique based on the Chinese Remainder Theorem, which allows single and double-precision complex calculations to be performed on INT8 matrix engines. Results demonstrate significant speedups, achieving up to 5. 6x and 6. 5x faster performance compared to standard routines on modern GPUs, and importantly, the approach offers a flexible trade-off between speed and accuracy, potentially establishing it as a versatile algorithm for a broad spectrum of applications.

Modern computing architectures increasingly feature low-precision matrix multiplication units that deliver substantially higher throughput than their high-precision counterparts. Recognizing this trend, scientists have focused on emulating high-precision matrix multiplication using low-precision hardware, a pursuit gaining significant momentum within the high-performance computing community. Building upon the Ozaki-II scheme, researchers are developing innovative techniques to accelerate computations.

Ozaki Scheme Extension for Tensor Cores

This research details a method for achieving high-performance and accurate matrix multiplication, known as DGEMM, using reduced precision hardware. The approach leverages integer arithmetic and tensor cores, specialized hardware units designed to accelerate matrix operations. Scientists extended the Ozaki scheme, a technique for emulating floating-point matrix multiplication using integer modular arithmetic, to effectively utilize modern hardware. The core challenge lies in maintaining accuracy while exploiting the performance benefits of reduced precision. The Ozaki scheme functions by representing floating-point numbers as integers modulo a carefully chosen prime number, then performing matrix multiplication using integer arithmetic designed to avoid rounding errors.

Researchers adapted this scheme to efficiently utilize tensor cores, refining it for modern hardware, controlling errors introduced during integer modular arithmetic, and optimizing it for various platforms including NVIDIA GPUs, AMD CPUs, and ARM processors. The key findings demonstrate that the extended Ozaki scheme can achieve DGEMM with guaranteed accuracy, even when using reduced precision hardware. This scheme significantly improves performance compared to traditional floating-point DGEMM, particularly on hardware with limited floating-point precision, and is adaptable to various hardware platforms. The research also investigates reduced precision formats, such as INT8 and FP8, and their impact on accuracy and performance, integrating with existing linear algebra libraries like cuBLAS and rocBLAS.

Efficient Low-Precision Complex Matrix Multiplication Achieved

Scientists have developed a novel method for efficiently emulating high-precision complex and real matrix multiplication using low-precision hardware, specifically INT8 matrix engines. Building upon the Ozaki-II scheme, this work introduces techniques that significantly accelerate computations while maintaining, and even improving, accuracy. The team achieved substantial speedups, ranging from 4. 0x to 6. 5x, over standard cuBLAS routines for single- and double-precision complex matrix multiplication on a B200 GPU, for sufficiently large problem sizes.

These gains represent a considerable advancement in computational efficiency for demanding matrix operations. The core of this breakthrough lies in a refined implementation of the Ozaki-II scheme, which converts high-precision matrices into integer representations, performs multiplication, and then converts the result back to floating-point format. Researchers optimized this process by carefully managing the scaling and conversion steps, ensuring minimal loss of precision. A key innovation involves a symmetric modulo operation that allows for more effective use of the available precision, enhancing the numerical robustness of the reconstruction step and leading to more accurate results. Experiments reveal that the proposed methods not only surpass vendor-optimized libraries like cuBLAS and hipBLAS, but also outperform the cuBLAS Ozaki-I-based ZGEMM emulation. Performance models closely align with measured results across various configurations, confirming the reliability and predictability of the approach, and demonstrating particular effectiveness for large matrices.

Low-Precision GPU Matrix Multiplication Speedups

This study presents new methods for emulating complex matrix multiplication using low-precision arithmetic on modern GPUs. Building upon the Ozaki-II scheme, researchers developed techniques that achieve significant speedups, between 4. 0x and 6. 5x, compared to standard complex matrix multiplication routines available in cuBLAS, particularly for large problem sizes. The approach leverages INT8 matrix engines, benefiting from the efficiency of low-precision operations without requiring complex exponent handling.

The developed methods not only demonstrate superior performance but also offer a trade-off between speed and accuracy, allowing for operation at higher speeds when lower precision is acceptable or delivering increased accuracy with a modest increase in computation time. This flexibility suggests potential for widespread adoption as a default algorithm across diverse applications. A portable library supporting both AMD and NVIDIA GPUs was also created, demonstrating the versatility of the approach. Researchers acknowledge that a primary limitation of emulation-based methods is the substantial working memory required, a challenge that remains open for both emulation techniques and high-performance computing applications. Future work will focus on comparing these methods with the BF16x9 CGEMM emulation algorithm available on newer NVIDIA GPUs and extending support to hardware architectures with reduced INT8 capabilities, with further investigation into reducing memory overhead also representing an important direction for ongoing research.

👉 More information
🗞 Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem
🧠 ArXiv: https://arxiv.org/abs/2512.08321

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Topology-aware Machine Learning Enables Better Graph Classification with 0.4 Gain

Llms Enable Strategic Computation Allocation with ROI-Reasoning for Tasks under Strict Global Constraints

January 10, 2026
Lightweight Test-Time Adaptation Advances Long-Term EMG Gesture Control in Wearable Devices

Lightweight Test-Time Adaptation Advances Long-Term EMG Gesture Control in Wearable Devices

January 10, 2026
Deep Learning Control AcDeep Learning Control Achieves Safe, Reliable Robotization for Heavy-Duty Machineryhieves Safe, Reliable Robotization for Heavy-Duty Machinery

Generalist Robots Validated with Situation Calculus and STL Falsification for Diverse Operations

January 10, 2026