Inexact Gauss-Seidel and Coarse Solvers Preserve Scalability to 64 GPUs with 700 Million Unknowns for AMG

Efficiently solving complex mathematical problems relies on robust iterative methods, and researchers continually seek ways to improve their performance on increasingly large datasets. Stephen Thomas from Lehigh University and Pasqua D’Ambra present a novel approach to these methods, focusing on optimising the solution of dense systems that arise within communication-avoiding Krylov techniques. Their work demonstrates that a modified Forward Gauss-Seidel method not only mirrors the accuracy of traditional orthogonalisation, but also scales effectively on modern GPU architectures, maintaining performance even with problems exceeding 700 million unknowns. Significantly, this technique extends to Algebraic MultiGrid methods, eliminating the need for computationally expensive dense operator assembly and factorisation, representing a substantial advance in the field of high-performance computing.

For weak scaling on AMD MI-series GPUs, the team demonstrates that 20 to 30 Forward Gauss-Seidel (FGS) iterations preserve scalability up to 64 GPUs with problem sizes exceeding 700 million unknowns. They further extend this approach to Algebraic MultiGrid (AMG) coarse-grid solves, removing the need to assemble or factor dense coarse operators.

Chebyshev Polynomials Accelerate Krylov Subspace Methods

The research team developed a novel approach to solving dense Gram systems, a critical bottleneck in communication-avoiding Krylov methods used for large-scale scientific simulations. Recognizing that global synchronizations dominate runtime at high processor counts, the study focused on reducing synchronization frequency within the s-step Conjugate Gradient (CG) method. This involved generating multiple Krylov basis vectors per outer iteration, necessitating efficient solutions to the resulting Gram systems. The team’s innovation centers on utilizing Forward Gauss-Seidel (FGS) iteration to solve these Gram systems, specifically those arising from Chebyshev polynomial bases.

The method exploits the unique structure of Gram matrices generated using Chebyshev polynomials, demonstrating mathematical equivalence between a single FGS sweep and Modified Gram-Schmidt (MGS) orthogonalization in the A-norm. This equivalence is supported by rigorous backward error bounds, establishing a strong theoretical foundation for the approach. Crucially, the team demonstrated that only a moderate number of FGS iterations, typically between 20 and 30 sweeps, are sufficient for convergence, reducing computational complexity. To validate the method, the researchers performed weak scaling tests on AMD MI-series GPUs, demonstrating scalability up to 64 GPUs while solving problems exceeding 700 million unknowns.

Furthermore, the study extended this approach to Algebraic MultiGrid (AMG) coarse-grid solves, eliminating the need for assembling or factoring dense coarse operators. The team’s analysis distinguishes between conditioning and decay of the Gram matrix, revealing that the effectiveness of FGS depends on the Frobenius norm of a lower triangular matrix, explaining why a relatively small number of FGS iterations are sufficient despite potential conditioning issues. This work establishes a powerful new technique for accelerating large-scale simulations by minimizing communication overhead and maximizing computational efficiency.

Fast Linear Solvers with Gauss-Seidel Iteration

Scientists achieved a low-synchronization approach to solving linear systems, employing a Forward Gauss-Seidel (FGS) iteration that proves mathematically equivalent to Modified Gram-Schmidt (MGS) orthogonalization in the A-norm. The team demonstrated that 20 to 30 FGS iterations preserve scalability up to 64 GPUs while solving problems exceeding 700 million unknowns, showcasing significant performance gains in parallel computing. This breakthrough extends to Algebraic MultiGrid (AMG) coarse-grid solves, eliminating the need for assembling or factoring dense coarse operators, streamlining the process of coarse-grid streaming. The research establishes that a single FGS sweep maintains backward stability, with computed solutions satisfying error bounds.

Analysis reveals that for problem sizes up to 20 in double precision, the error contribution is negligible. Further investigation into the conditioning of Chebyshev-based Gram matrices demonstrates polynomial growth of the condition number, and a Frobenius norm scaling as O(√s). This contrasts sharply with monomial bases, which exhibit exponential growth, and highlights the efficiency of Chebyshev polynomials in maintaining numerical stability. The team proved that the iteration matrix associated with FGS has a spectral radius determining the convergence rate, and that geometric convergence holds when this radius is less than 1.

The study mathematically confirms the algebraic equivalence of a single FGS sweep to one step of MGS orthogonalization, providing a compact view of FGS and explaining its numerical behavior. This equivalence is demonstrated through a rigorous proof by induction, establishing that FGS and MGS compute identical projection coefficients, and solidifying the method’s theoretical foundation. The results demonstrate that the method is robust and efficient, offering a significant advancement in solving large-scale linear systems.

Low-Synchronization Solves Large Linear Systems

This research demonstrates a low-synchronization approach to solving linear systems, crucial for communication-avoiding Krylov methods, by utilising the Forward Gauss-Seidel method. The team established mathematical equivalence between this method and Modified Gram-Schmidt orthogonalization, ensuring accurate projection coefficients while minimising data transfer between processors. Results show that a limited number of Forward Gauss-Seidel iterations, between 20 and 30, maintains scalability when applied to problems exceeding 700 million unknowns across 64 GPUs. The streaming approach to coarse-grid solves closely matched direct methods, requiring a comparable number of V-cycle iterations with significant memory savings. This research provides a promising pathway towards more efficient and scalable solutions for large linear systems, particularly in communication-constrained environments.

👉 More information
🗞 Inexact Gauss Seidel and Coarse Solvers for AMG and s-step CG
🧠 ArXiv: https://arxiv.org/abs/2512.09642

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Deep Photonic Neuromorphic Networks Demonstrate Unsupervised Hebbian Learning Online

Deep Photonic Neuromorphic Networks Demonstrate Unsupervised Hebbian Learning Online

February 5, 2026
Quantum Entanglement Boosts Computer Coordination, Bypassing Speed Limits of Distance

Quantum Entanglement Boosts Computer Coordination, Bypassing Speed Limits of Distance

February 5, 2026
Shows QSVM Generalisation Bounds under Local Depolarising Noise for NISQ Devices

Shows QSVM Generalisation Bounds under Local Depolarising Noise for NISQ Devices

February 5, 2026