The increasing demand for computational power drives innovation in how computers perform fundamental operations, such as matrix multiplication, and researchers continually seek ways to accelerate these processes. Yuki Uchino from RIKEN Center for Computational Science, Qianxiang Ma, Toshiyuki Imamura from RIKEN Center for Computational Science, and Katsuhisa Ozaki from Shibaura Institute of Technology, alongside Patrick Lars Gutsche from Ecole Normale Superieure de Lyon, have developed new methods to efficiently perform complex matrix multiplication using low-precision hardware. Building on previous work, the team proposes a high-performance emulation technique based on the Chinese Remainder Theorem, which allows single and double-precision complex calculations to be performed on INT8 matrix engines. Results demonstrate significant speedups, achieving up to 5. 6x and 6. 5x faster performance compared to standard routines on modern GPUs, and importantly, the approach offers a flexible trade-off between speed and accuracy, potentially establishing it as a versatile algorithm for a broad spectrum of applications.
Modern computing architectures increasingly feature low-precision matrix multiplication units that deliver substantially higher throughput than their high-precision counterparts. Recognizing this trend, scientists have focused on emulating high-precision matrix multiplication using low-precision hardware, a pursuit gaining significant momentum within the high-performance computing community. Building upon the Ozaki-II scheme, researchers are developing innovative techniques to accelerate computations.
Ozaki Scheme Extension for Tensor Cores
This research details a method for achieving high-performance and accurate matrix multiplication, known as DGEMM, using reduced precision hardware. The approach leverages integer arithmetic and tensor cores, specialized hardware units designed to accelerate matrix operations. Scientists extended the Ozaki scheme, a technique for emulating floating-point matrix multiplication using integer modular arithmetic, to effectively utilize modern hardware. The core challenge lies in maintaining accuracy while exploiting the performance benefits of reduced precision. The Ozaki scheme functions by representing floating-point numbers as integers modulo a carefully chosen prime number, then performing matrix multiplication using integer arithmetic designed to avoid rounding errors.
Researchers adapted this scheme to efficiently utilize tensor cores, refining it for modern hardware, controlling errors introduced during integer modular arithmetic, and optimizing it for various platforms including NVIDIA GPUs, AMD CPUs, and ARM processors. The key findings demonstrate that the extended Ozaki scheme can achieve DGEMM with guaranteed accuracy, even when using reduced precision hardware. This scheme significantly improves performance compared to traditional floating-point DGEMM, particularly on hardware with limited floating-point precision, and is adaptable to various hardware platforms. The research also investigates reduced precision formats, such as INT8 and FP8, and their impact on accuracy and performance, integrating with existing linear algebra libraries like cuBLAS and rocBLAS.
Efficient Low-Precision Complex Matrix Multiplication Achieved
Scientists have developed a novel method for efficiently emulating high-precision complex and real matrix multiplication using low-precision hardware, specifically INT8 matrix engines. Building upon the Ozaki-II scheme, this work introduces techniques that significantly accelerate computations while maintaining, and even improving, accuracy. The team achieved substantial speedups, ranging from 4. 0x to 6. 5x, over standard cuBLAS routines for single- and double-precision complex matrix multiplication on a B200 GPU, for sufficiently large problem sizes.
These gains represent a considerable advancement in computational efficiency for demanding matrix operations. The core of this breakthrough lies in a refined implementation of the Ozaki-II scheme, which converts high-precision matrices into integer representations, performs multiplication, and then converts the result back to floating-point format. Researchers optimized this process by carefully managing the scaling and conversion steps, ensuring minimal loss of precision. A key innovation involves a symmetric modulo operation that allows for more effective use of the available precision, enhancing the numerical robustness of the reconstruction step and leading to more accurate results. Experiments reveal that the proposed methods not only surpass vendor-optimized libraries like cuBLAS and hipBLAS, but also outperform the cuBLAS Ozaki-I-based ZGEMM emulation. Performance models closely align with measured results across various configurations, confirming the reliability and predictability of the approach, and demonstrating particular effectiveness for large matrices.
Low-Precision GPU Matrix Multiplication Speedups
This study presents new methods for emulating complex matrix multiplication using low-precision arithmetic on modern GPUs. Building upon the Ozaki-II scheme, researchers developed techniques that achieve significant speedups, between 4. 0x and 6. 5x, compared to standard complex matrix multiplication routines available in cuBLAS, particularly for large problem sizes. The approach leverages INT8 matrix engines, benefiting from the efficiency of low-precision operations without requiring complex exponent handling.
The developed methods not only demonstrate superior performance but also offer a trade-off between speed and accuracy, allowing for operation at higher speeds when lower precision is acceptable or delivering increased accuracy with a modest increase in computation time. This flexibility suggests potential for widespread adoption as a default algorithm across diverse applications. A portable library supporting both AMD and NVIDIA GPUs was also created, demonstrating the versatility of the approach. Researchers acknowledge that a primary limitation of emulation-based methods is the substantial working memory required, a challenge that remains open for both emulation techniques and high-performance computing applications. Future work will focus on comparing these methods with the BF16x9 CGEMM emulation algorithm available on newer NVIDIA GPUs and extending support to hardware architectures with reduced INT8 capabilities, with further investigation into reducing memory overhead also representing an important direction for ongoing research.
👉 More information
🗞 Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem
🧠 ArXiv: https://arxiv.org/abs/2512.08321
