Solving sparse linear systems, a common task in scientific computing, often encounters performance limitations due to the way computers access data during the solution process. Atharva Gondhalekar, from the Department of Electrical and Computer Engineering, Kjetil Haugen, from the Department of Computer and Information Science, and Thomas Gibson and Wu-chun Feng, from the Department of Mathematics, address this challenge by adapting triangular solves, a key component of these calculations, to the parallel architecture of modern GPUs. Their research introduces a novel strategy for breaking down the problem into many small, independent parts, allowing each to be processed simultaneously and efficiently utilising the GPU’s on-chip memory. This fine-grained domain decomposition significantly reduces the need to access slower main memory, resulting in substantial speedups, over ten times faster for triangular solves and a threefold increase in performance for a widely used iterative solver, on the latest GPU hardware.
Researchers have developed a new approach to solving large systems of linear equations, a common task in scientific computing, achieving substantial speedups on modern graphics processing units (GPUs). The core innovation lies in a method of dividing the problem into smaller, independent subdomains, allowing for significantly increased parallelism and overcoming bottlenecks associated with irregular memory access on GPUs.
GPU Acceleration of Sparse Triangular Solves
This research focuses on accelerating the solution of sparse triangular systems of linear equations, crucial for applications like finite element analysis and computational fluid dynamics. The team leverages GPU parallel processing with techniques such as domain decomposition, dividing the problem into smaller subproblems solved concurrently, and investigates multilevel methods, parallel algorithms, and data layout optimisations to enhance performance. Key techniques explored include the BiCGSTAB method and its variations for solving nonsymmetric linear systems, recycling Krylov subspaces to accelerate subsequent solves, and strategies for parallelising graph-based computations. The research also considers hierarchical matrices to reduce computational cost by approximating dense matrices, aiming for scalability and high performance on multi-GPU systems.
GPU Speedups via Domain Decomposition
The team’s approach assigns each subdomain to a thread block on the GPU, ensuring all necessary data fits within fast shared memory, thereby minimising slow access to global memory. While this introduces a slight increase in calculations, the gains from improved memory access and parallelism far outweigh this cost, further refined by constructing a directed acyclic graph (DAG) for each subdomain to enable even more efficient data processing. Compared to existing implementations, the new method achieves a 10. 7-fold speedup in solving the triangular systems and a 3. 2-fold speedup for the complete iterative solver. On a challenging petroleum reservoir simulation problem with over 31 million unknowns, the runtime reduced from nearly 50 milliseconds to just over 4 milliseconds, demonstrating effective scaling across multiple GPUs and suggesting potential for even greater performance gains. By isolating the performance of the subdomain-based method, the researchers showed that the core innovation delivers substantial benefits without being limited by communication overhead.
GPU Speedups with Fine-Grained Decomposition
Researchers have developed a new approach to solving sparse linear systems, focusing on optimisations for the GPU architecture. By employing a fine-grained domain decomposition strategy, they achieved significant speedups in triangular solves and, consequently, in the ILU0-preconditioned biconjugate gradient stabilised (BiCGSTAB) solver. The method involves dividing the problem into smaller subdomains, each processed independently by GPU compute units, with subdomain vector data sized to fit entirely in shared memory for faster access. The results demonstrate a 10. 7-fold speedup for triangular solves and a 3.
BiCGSTAB solver on a single AMD Instinct MI210 GPU, exhibiting near-linear scaling across multiple GPUs with up to eight AMD Instinct MI210 GPUs for sufficiently large problems. While the method requires 1. 6 times more iterations to converge than the baseline, the overall performance gains still represent a substantial improvement. Future research directions include exploring reordering matrices to improve data locality, extending the triangular solve kernels to handle non-uniform subdomain sizes, and evaluating the optimisations within two-stage preconditioners, with plans to investigate applying the decomposition strategy to other preconditioners and exploring a combined CPU-GPU implementation.
👉 More information
🗞 Mapping Sparse Triangular Solves to GPUs via Fine-grained Domain Decomposition
🧠 ArXiv: https://arxiv.org/abs/2508.04917
