Matrix multiplication forms the bedrock of many demanding computations in fields like artificial intelligence and high-performance computing, and optimising this process is crucial for continued progress. Qiao Zhang from Saint Louis University, Rabab Alomairy from both Massachusetts Institute of Technology and King Abdullah University of Science and Technology, and Dali Wang from Oak Ridge National Laboratory, alongside colleagues, present a new approach to matrix multiplication that takes full advantage of modern computer hardware. Their research introduces a flexible system capable of performing calculations with varying levels of precision, adapting to the strengths of different processors and significantly improving both speed and energy efficiency. By carefully balancing workloads across diverse architectures, including leading supercomputers such as Fugaku, A100 DGX, and Frontier, this work demonstrates a pathway towards more powerful and sustainable computing for a wide range of applications.
Artificial intelligence (AI) increasingly demands high-performance computing, and the emergence of hardware optimised for low-precision arithmetic necessitates a reevaluation of numerical algorithms. This research introduces an adaptive mixed-precision GEMM framework that supports different precision formats at a fine-grained tile level, offering a flexible approach to numerical computation. The team utilises the PaRSEC runtime system to balance workloads across various architectures, ensuring efficient resource allocation and scalability. The core idea is to combine lower precision arithmetic, like FP16, with higher precision, like FP32, to accelerate calculations without significantly sacrificing accuracy. The team leverages the PaRSEC runtime system, a task-based programming framework, to implement and evaluate this mixed-precision GEMM approach. PaRSEC provides the necessary flexibility to manage the complexities of executing calculations with varying precision levels. Performance is optimised through a carefully selected mixed-precision strategy, efficient task scheduling using PaRSEC, and by minimising data movement and redistribution. The PaRSEC framework allows the implementation to scale to extreme-scale computing environments and adapt to different hardware architectures, and the research also considers exploiting data sparsity within the matrices to further improve performance. This work demonstrates the potential of mixed-precision computing, combined with a sophisticated runtime system, to significantly accelerate GEMM operations, with implications for scientific computing, machine learning, and data analytics.
Dynamic Precision Scaling Boosts Matrix Multiplication
General Matrix Multiplication (GEMM) is a fundamental operation in many high-performance computing and artificial intelligence applications, and researchers have developed a new framework to significantly improve its efficiency. This framework intelligently adapts the precision used for calculations at a very fine level, individual blocks within the larger matrix, allowing for a balance between speed and accuracy. By leveraging hardware increasingly optimised for lower-precision arithmetic, the framework achieves substantial gains in performance and energy efficiency. Experiments demonstrate that increasing the proportion of single precision calculations generally leads to improved performance, effectively doubling speed on some architectures.
This is achieved through careful workload balancing managed by the PaRSEC runtime system, which efficiently distributes tasks across different processing units and manages the complexities introduced by varying precision levels. Performance gains are substantial across a range of hardware, including ARM-based supercomputers like Fugaku, and GPU-accelerated systems like Guyot and Frontier. On Fugaku, the framework achieves 242. 2 Tflop/s on 64 nodes with 94. 6% parallel efficiency when using primarily single precision, starting from a baseline of 4.
0 Tflop/s on a single node. Similarly, on Frontier, performance scales from 181. 3 Tflop/s on a single node to an impressive 11,310. 3 Tflop/s across 64 nodes, maintaining 97. 5% parallel efficiency. Notably, the framework exhibits near-linear scalability, meaning that adding more processing units consistently increases performance. These results demonstrate the significant potential of adaptive mixed-precision techniques to unlock optimal computational efficiency in diverse computing environments, paving the way for faster and more energy-efficient applications in fields like scientific simulation and machine learning.
Adaptive Precision for Efficient Matrix Multiplication
This research introduces a novel adaptive mixed-precision framework for General Matrix Multiplication (GEMM), a fundamental operation in high-performance computing and artificial intelligence. The method dynamically adjusts precision levels at a fine-grained tile level within the computation, offering a way to improve computational efficiency and accuracy by exploiting the benefits of low-precision arithmetic without compromising results. By utilising the PaRSEC runtime system, the framework effectively balances workloads across different computer architectures, demonstrating good scalability on systems including ARM CPUs, GPUs, and advanced supercomputers. The team’s approach builds upon existing work in mixed-precision computation and task-based runtime systems, offering a distinct contribution through its tile-centric adaptation of precision. Results show the framework’s ability to leverage hardware optimised for low-precision calculations, potentially leading to significant reductions in computation time and energy consumption.
👉 More information
🗞 Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach
🧠 ArXiv: https://arxiv.org/abs/2508.14848
