The increasing diversity of high-performance computing hardware, coupled with the speed gains offered by modern GPUs when using lower precision calculations, drives the need for algorithms that perform well across different systems. Sreeram Venkat from the Oden Institute at The University of Texas at Austin, Kasia Świrydowicz and Noah Wolfe from Advanced Micro Devices, Inc., and colleagues address this challenge by developing a performance-portable and mixed-precision framework for solving problems involving block-triangular Toeplitz matrices. Their work focuses on FFTMatvec, an application that efficiently computes matrix-vector products, and enables it to run seamlessly on a range of GPUs without code changes. By integrating performance optimizations into the open-source rocBLAS library and employing a dynamic mixed-precision approach, the team achieves excellent performance and scalability, demonstrating the application’s ability to run on over two thousand GPUs on a leading supercomputer.
AMD GPU Linear Algebra Performance Evaluation
This document details a comprehensive evaluation of computational artifacts for scientific software, focusing on fft_matvec and rocBLAS. The evaluation aims to provide independent verification of results presented in a related scientific publication, outlining the process for setting up, executing, and analyzing the software to reproduce the findings. It covers hardware and software prerequisites, building the software, running tests, and analyzing the resulting data. The evaluation of fft_matvec centers on mixed-precision performance optimization, demonstrating performance gains through varying precision settings and problem sizes.
The rocBLAS evaluation focused on verifying an optimization to the sgemv function, utilizing specific commit references and a YAML configuration file for consistent and reproducible benchmark tests. This document provides exceptional clarity and detail essential for independent verification, ensuring reproducibility with clearly explained steps and specified parameters. The use of specific commit references and the YAML configuration file simplifies the process, making it an excellent artifact evaluation document demonstrating a strong commitment to transparency.
HIP and Dynamic Mixed-Precision for Portability
Researchers addressed the challenge of running complex scientific simulations on diverse high-performance computing hardware by developing a framework centered around the HIP programming model and dynamic mixed-precision algorithms. Recognizing the increasing diversity of hardware and the performance benefits of lower-precision computations, they sought a method to maintain a single codebase while achieving portability and speed, leveraging HIP to translate existing CUDA code. A key innovation lies in applying dynamic mixed-precision techniques to the FFTMatvec application, intelligently adjusting precision to accelerate computations while maintaining desired accuracy. This involved Pareto front analysis to determine the optimal balance between precision and performance, adapting to different hardware and simulation requirements, and building upon existing mixed-precision methods like iterative refinement.
The framework integrates seamlessly with the rocBLAS library, incorporating performance optimizations directly into the underlying mathematical routines without application code changes. The team demonstrated scalability by running FFTMatvec on up to 2,048 GPUs on the OLCF Frontier supercomputer, showcasing its ability to handle extremely large and complex simulations. This combination of HIP-based portability and dynamic mixed-precision optimization represents a powerful new approach to scientific computing.
Dynamic Precision Enables Cross-GPU Computing
Researchers developed a new approach to enhance the performance of scientific computing applications on modern GPUs, addressing both speed and hardware diversity. Recognizing the benefits of lower-precision calculations, they adapted an existing scientific application to take advantage of this capability without sacrificing accuracy, dynamically adjusting precision to find an optimal balance between speed and reliable results. This method successfully enabled the application, originally written for NVIDIA hardware, to run seamlessly on AMD GPUs with excellent performance, without extensive code refactoring. By mirroring the CUDA paradigm through the HIP framework, the transition to AMD hardware became simpler, allowing existing codebases to be readily adapted. The results demonstrate substantial performance gains through mixed-precision computing, using lower precision for intermediate calculations and higher precision for final results. The framework was successfully scaled to utilize 2,048 GPUs on a leading supercomputer, demonstrating its potential for tackling extremely demanding scientific problems, offering a practical solution for maximizing performance on diverse high-performance computing hardware.
Dynamic Precision FFTMatvec on AMD GPUs
This work presents a performance-portable and dynamically mixed-precision framework applied to FFTMatvec, a high-performance computing application that efficiently solves matrix-vector products. Researchers successfully adapted FFTMatvec, originally written for NVIDIA CUDA GPUs, to run seamlessly on AMD GPUs using a compilation approach called hipify, without changes to the application’s core code, integrating optimizations directly into the open-source rocBLAS library. The team also demonstrated a dynamic mixed-precision strategy, determining optimal configurations through Pareto front analysis to balance computational speed with acceptable error tolerances. Benchmarking on AMD Instinct MI250X, MI300X, and MI355X GPUs, and scaling to 2,048 GPUs on the Frontier supercomputer, revealed a speedup of approximately 30% at 640 GPUs compared to double-precision calculations.
👉 More information
🗞 Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices
🧠 ArXiv: https://arxiv.org/abs/2508.10202
