Symmetric linear solves underpin crucial calculations in diverse fields such as climate modelling, engineering and machine learning, often relying on computationally intensive processes like Cholesky decomposition. Vicki Carrica, Rabab Alomairy, and Evelyne Ringoot, all from the Massachusetts Institute of Technology, alongside Alan Edelman, have developed a novel, portable mixed-precision solver specifically designed to accelerate these calculations on Matrix Processing Units (MXUs). Their research introduces an algorithm that utilises a hierarchical recursive approach, exposing parallelism within Cholesky decomposition to maximise computational throughput and control numerical precision. By strategically applying low-precision arithmetic to off-diagonal blocks while maintaining high precision on diagonals, the team achieves significant speedups on both NVIDIA H200 and AMD MI300X GPUs. This innovation delivers up to a 27x speedup in symmetric rank-k updates and a 5x overall speedup for Cholesky decomposition compared to existing methods, alongside substantially improved accuracy.
Summary of the Research Paper: “Accelerating Linear Solve with Mixed Precision Nested Recursive Subdivision on AI Hardware”.
This paper details a novel approach to accelerating the solution of linear systems using a combination of mixed precision arithmetic, nested recursive subdivision, and optimization for modern AI hardware (GPUs). The core idea is to leverage the strengths of different precision levels and a recursive algorithm to improve performance and reduce memory footprint.
The research introduces an algorithm that utilises a hierarchical recursive approach, exposing parallelism within Cholesky decomposition to maximise computational throughput and control numerical precision. The study’s methodological innovation lies in a custom recursive data structure that strategically assigns low-precision FP16 arithmetic to large off-diagonal blocks while maintaining high precision on diagonal blocks, ensuring numerical stability throughout the computation. This hierarchical recursion increases the granularity of GEMM operations, optimising hardware utilisation on GPUs and enabling fine-grained control over numerical precision.
Mixed-Precision Cholesky Decomposition on MXUs
Symmetric linear solves are crucial across numerous scientific and engineering disciplines, including climate modelling and structural analysis. The work centres on Cholesky decomposition, utilising triangular solves (TRSM) and symmetric rank-k updates (SYRK) as core components.
Scientists achieved a breakthrough by implementing a nested recursive formulation, exposing parallelism through recursive decomposition of TRSM and SYRK sub-problems, thereby maximising computational throughput. Experiments revealed a custom recursive data structure assigning low-precision FP16 arithmetic to large off-diagonal blocks, while maintaining high precision on diagonal blocks to guarantee numerical stability. Results demonstrate that mixed-precision computation delivers up to a 27x speedup in SYRK and 5.3x in TRSM over full-precision baselines. This culminated in a 5.32x overall speedup for Cholesky decomposition versus cuSOLVER FP64, while simultaneously achieving 100x better accuracy than a pure FP16 implementation, retaining 88% of its peak speedup. The researchers developed a recursive algorithm for Cholesky decomposition and its constituent operations , triangular solves and symmetric rank-k updates , that maximises computational throughput and allows for fine-grained control of numerical precision.
A key contribution is a custom data structure enabling the selective application of low-precision FP16 arithmetic to off-diagonal blocks, while maintaining high precision on diagonal blocks to preserve numerical stability. Specifically, the team achieved up to a 27x speedup in symmetric rank-k updates and a 5x speedup in Cholesky decomposition compared to full-precision baselines, alongside substantial gains over vendor-supplied libraries like cuBLAS and cuSOLVER. Importantly, the mixed-precision approach maintains high accuracy, achieving 100x better accuracy than a pure FP16 implementation while retaining a large proportion of the peak speedup. The authors acknowledge that their detailed performance analysis focused primarily on NVIDIA and AMD GPU architectures. Future work could explore extending the solver to other hardware platforms and investigating the potential for further optimisations within the recursive framework.
👉 More information
🗞 Hierarchical Precision and Recursion for Accelerating Symmetric Linear Solves on MXUs
🧠 ArXiv: https://arxiv.org/abs/2601.08082
