GPU Domain Decomposition Accelerates Solution of Sparse Linear Systems

Solving sparse linear systems, a common task in scientific computing, often encounters performance limitations due to the way computers access data during the solution process. Atharva Gondhalekar, from the Department of Electrical and Computer Engineering, Kjetil Haugen, from the Department of Computer and Information Science, and Thomas Gibson and Wu-chun Feng, from the Department of Mathematics, address this challenge by adapting triangular solves, a key component of these calculations, to the parallel architecture of modern GPUs. Their research introduces a novel strategy for breaking down the problem into many small, independent parts, allowing each to be processed simultaneously and efficiently utilising the GPU’s on-chip memory. This fine-grained domain decomposition significantly reduces the need to access slower main memory, resulting in substantial speedups, over ten times faster for triangular solves and a threefold increase in performance for a widely used iterative solver, on the latest GPU hardware.

Researchers have developed a new approach to solving large systems of linear equations, a common task in scientific computing, achieving substantial speedups on modern graphics processing units (GPUs). The core innovation lies in a method of dividing the problem into smaller, independent subdomains, allowing for significantly increased parallelism and overcoming bottlenecks associated with irregular memory access on GPUs.

GPU Acceleration of Sparse Triangular Solves

This research focuses on accelerating the solution of sparse triangular systems of linear equations, crucial for applications like finite element analysis and computational fluid dynamics. The team leverages GPU parallel processing with techniques such as domain decomposition, dividing the problem into smaller subproblems solved concurrently, and investigates multilevel methods, parallel algorithms, and data layout optimisations to enhance performance. Key techniques explored include the BiCGSTAB method and its variations for solving nonsymmetric linear systems, recycling Krylov subspaces to accelerate subsequent solves, and strategies for parallelising graph-based computations. The research also considers hierarchical matrices to reduce computational cost by approximating dense matrices, aiming for scalability and high performance on multi-GPU systems.

GPU Speedups via Domain Decomposition

The team’s approach assigns each subdomain to a thread block on the GPU, ensuring all necessary data fits within fast shared memory, thereby minimising slow access to global memory. While this introduces a slight increase in calculations, the gains from improved memory access and parallelism far outweigh this cost, further refined by constructing a directed acyclic graph (DAG) for each subdomain to enable even more efficient data processing. Compared to existing implementations, the new method achieves a 10. 7-fold speedup in solving the triangular systems and a 3. 2-fold speedup for the complete iterative solver. On a challenging petroleum reservoir simulation problem with over 31 million unknowns, the runtime reduced from nearly 50 milliseconds to just over 4 milliseconds, demonstrating effective scaling across multiple GPUs and suggesting potential for even greater performance gains. By isolating the performance of the subdomain-based method, the researchers showed that the core innovation delivers substantial benefits without being limited by communication overhead.

GPU Speedups with Fine-Grained Decomposition

Researchers have developed a new approach to solving sparse linear systems, focusing on optimisations for the GPU architecture. By employing a fine-grained domain decomposition strategy, they achieved significant speedups in triangular solves and, consequently, in the ILU0-preconditioned biconjugate gradient stabilised (BiCGSTAB) solver. The method involves dividing the problem into smaller subdomains, each processed independently by GPU compute units, with subdomain vector data sized to fit entirely in shared memory for faster access. The results demonstrate a 10. 7-fold speedup for triangular solves and a 3.
BiCGSTAB solver on a single AMD Instinct MI210 GPU, exhibiting near-linear scaling across multiple GPUs with up to eight AMD Instinct MI210 GPUs for sufficiently large problems. While the method requires 1. 6 times more iterations to converge than the baseline, the overall performance gains still represent a substantial improvement. Future research directions include exploring reordering matrices to improve data locality, extending the triangular solve kernels to handle non-uniform subdomain sizes, and evaluating the optimisations within two-stage preconditioners, with plans to investigate applying the decomposition strategy to other preconditioners and exploring a combined CPU-GPU implementation.

👉 More information
🗞 Mapping Sparse Triangular Solves to GPUs via Fine-grained Domain Decomposition
🧠 ArXiv: https://arxiv.org/abs/2508.04917

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

Toyota & ORCA Achieve 80% Compute Time Reduction Using Quantum Reservoir Computing

Toyota & ORCA Achieve 80% Compute Time Reduction Using Quantum Reservoir Computing

January 14, 2026
GlobalFoundries Acquires Synopsys’ Processor IP to Accelerate Physical AI

GlobalFoundries Acquires Synopsys’ Processor IP to Accelerate Physical AI

January 14, 2026
Fujitsu & Toyota Systems Accelerate Automotive Design 20x with Quantum-Inspired AI

Fujitsu & Toyota Systems Accelerate Automotive Design 20x with Quantum-Inspired AI

January 14, 2026