GPU Implementation Advances PDE Solutions with Scalable Iterative Solvers and High Throughput

Solving complex equations that describe real-world phenomena often demands immense computational power, and researchers continually seek ways to accelerate these calculations. Andrew Welter and Ngoc Cuong Nguyen, from the Department of Aeronautics and Astronautics at Massachusetts Institute of Technology, have developed new techniques to significantly speed up solutions for a widely used method called Hybridizable Discontinuous Galerkin discretisation. Their work focuses on harnessing the parallel processing capabilities of modern graphics processing units (GPUs) and introduces innovative preconditioning strategies that avoid complex data structures, instead prioritising efficient use of the GPU’s memory and processing power. This achievement unlocks the potential for faster and more accurate simulations across a broad range of applications, including fluid dynamics, structural mechanics, and weather forecasting, ultimately enabling scientists and engineers to tackle increasingly complex problems.

Researchers explore the fundamental principles of DG methods, highlighting their advantages in handling complex geometries, discontinuities, and achieving high-order accuracy. Hybridizable DG (HDG) formulations also receive significant attention, finding application in fluid dynamics, solving hyperbolic conservation laws, modelling linear elasticity, simulating thermospheric physics, and handling simulations with deformable domains. A key focus lies on leveraging high-performance computing to accelerate DG computations.

Scientists have developed techniques for GPU acceleration, particularly optimizing the sparse matrix-vector product, a core operation in DG discretizations. They utilize polymorphic memory access patterns, enabled by the Kokkos programming model, to improve data locality and performance on manycore architectures. Kokkos, and its extension Kokkos 3, provides performance portability, allowing code to run efficiently on various architectures including CPUs and GPUs. Researchers also exploit parallelism at multiple levels, including element-level and data parallelism, and employ matrix-free methods to reduce memory footprint.

Preconditioning techniques, including polynomial preconditioning and GMRES, further accelerate the solution of linear systems. The research encompasses various numerical schemes and algorithms, including Runge-Kutta methods for time integration, approximate Riemann solvers for hyperbolic problems, and techniques for controlling numerical dissipation. Scientists also develop subgrid-scale models for turbulence in large-eddy simulation and refine meshes near boundary layers to improve accuracy. High-order schemes and diagonally implicit Runge-Kutta (DIAG) methods are utilized to enhance accuracy and efficiency. The work centers on efficiently solving the large linear systems that arise from HDG discretizations. Researchers engineered a GPU-tailored algorithm that eliminates local degrees of freedom in parallel, directly assembling the globally condensed system on the GPU using dense-block operations. This implementation avoids the complexities of sparse data structures, increasing arithmetic intensity and sustaining high memory throughput across a range of mesh types and polynomial orders.

For nonlinear problems, the study combines Newton’s method with the preconditioned Generalized Minimal Residual (GMRES) method, integrating scalable preconditioners including block-Jacobi, additive Schwarz domain decomposition, and polynomial smoothers. Crucially, all preconditioners are implemented in a batched form with architecture-aware optimizations, including dense linear algebra kernels, memory-coalesced vector operations, and shared-memory acceleration, minimizing memory traffic and maximizing parallel occupancy. Comprehensive studies were conducted across a variety of PDEs, including the Poisson equation, Burgers equation, linear and nonlinear elasticity, Euler equations, Navier-Stokes equations, and Reynolds-Averaged Navier-Stokes equations. Researchers developed scalable iterative solvers and preconditioning strategies specifically tailored for GPU architectures, eliminating local element degrees of freedom in parallel and assembling the global system directly on the device using dense-block operations. This implementation avoids the complexities of sparse data structures, significantly increasing arithmetic intensity and sustaining high memory throughput across a range of meshes and polynomial orders. The team implemented a nonlinear solver combining Newton’s method with preconditioned Generalized Minimal Residual (GMRES), integrating scalable preconditioners including block-Jacobi, additive Schwarz domain decomposition, and polynomial smoothers.

Crucially, all preconditioners are implemented in a batched form, incorporating architecture-aware optimizations such as dense linear algebra kernels, memory-coalesced vector operations, and shared-memory acceleration, minimizing memory traffic and maximizing parallel occupancy. Comprehensive studies were conducted across a diverse set of PDEs, including the Poisson equation, Burgers equation, linear and nonlinear elasticity, Euler equations, Navier-Stokes equations, and Reynolds-Averaged Navier-Stokes equations, using both structured and unstructured meshes with varying element types and polynomial orders on NVIDIA and GPU architectures. Experiments demonstrate the effectiveness of this approach, achieving substantial performance gains through optimized memory access and parallel processing. Through extensive testing across a diverse range of polynomial degrees, mesh resolutions, and problem stiffness levels, the team demonstrated substantial improvements in computational efficiency. Notably, the additive Schwarz method consistently reduced the number of iterations required by the GMRES solver, often decreasing solution time by factors of two to ten, particularly for challenging three-dimensional nonlinear elasticity and turbulent flow problems. The addition of polynomial preconditioning further enhanced convergence for certain problems, such as the Poisson equation and nonlinear elasticity, achieving performance largely independent of mesh refinement and polynomial degree. The researchers acknowledge that the optimal approach requires careful consideration of the underlying operator’s spectral properties, and that uniform application of polynomial preconditioning is not always beneficial.

👉 More information
🗞 Preconditioning Techniques for Hybridizable Discontinuous Galerkin Discretizations on GPU Architectures
🧠 ArXiv: https://arxiv.org/abs/2512.13619

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum Light’s Wave-Particle Balance Now Fully Tunable

Quantum Light’s Wave-Particle Balance Now Fully Tunable

March 2, 2026
AI Swiftly Answers Questions by Focusing on Key Areas

AI Swiftly Answers Questions by Focusing on Key Areas

February 27, 2026
Machine Learning Sorts Quantum States with High Accuracy

Machine Learning Sorts Quantum States with High Accuracy

February 27, 2026