The challenge of efficiently solving complex fluid dynamics simulations increasingly demands greater computational power, yet traditional methods often struggle to fully utilise the capabilities of modern multi-GPU systems. Seungchan Kim, Jihoo Kim, Sanghyun Ha, and Donghyun You from the Department of Mechanical Engineering at Pohang University of Science and Technology address this limitation with a new approach to the tridiagonal matrix algorithm (TDMA). Their work introduces a pipelined TDMA that overcomes scalability bottlenecks by overlapping communication with computation and enabling concurrent GPU kernel execution, effectively hiding time-consuming processes behind faster calculations. This innovation delivers substantial performance gains, achieving near-ideal weak scaling up to 64 GPUs and a significant speedup in a large-scale flow solver, paving the way for more efficient and detailed simulations of complex physical phenomena.
Scalable Turbulence Simulations with Pencil Decomposition
This research details the development of a highly scalable solver for simulating turbulent wall flows at extremely high Reynolds numbers. Simulating turbulence is computationally demanding, requiring massive parallelization, and achieving good scalability on modern supercomputers presents significant challenges. This work focuses on simulating turbulent flows near walls, a particularly difficult task due to the need to resolve very fine scales. The authors employ a pencil-distributed approach, dividing the computational domain into narrow columns of grid points, with each column assigned to a process for local calculations.
They utilize a finite-difference method to solve the governing equations of fluid motion and minimize communication between these columns, leveraging the cuSPARSE library for efficient matrix calculations. The solver is specifically designed for scalability on massively parallel architectures, utilizing the strengths of modern supercomputers. The solver demonstrates excellent scalability, achieving performance on millions of cores and enabling simulations of turbulent wall flows at unprecedentedly high Reynolds numbers. Validation against existing data confirms the solver’s accuracy, and the code is publicly available to facilitate further research. This work presents a highly optimized and scalable solver that pushes the boundaries of simulating turbulent wall flows, paving the way for a deeper understanding of these complex phenomena.
Pipelined Algorithm Scales Tridiagonal Matrix Solves
Scientists developed Pipelined-TDMA, a novel algorithm for solving tridiagonal matrix systems, designed to overcome scalability limitations in multi-GPU systems. Traditional methods struggle with scalability due to sequential processing and communication delays; this new technique overcomes these limitations by overlapping computation with communication and executing tasks concurrently. Researchers optimized the batch size, carefully balancing GPU occupancy with pipeline efficiency. Performance evaluations on up to 64 GPUs demonstrate the method’s success in concealing most non-scalable execution time, except during the final phase of the pipeline.
The solver achieves ideal weak scaling up to 64 GPUs, processing one billion grid cells per GPU, and reaches 74. 7 percent of ideal performance in strong scaling tests for a four-billion-cell problem, relative to a baseline configuration using four GPUs. Integrating the optimized solver into a flow solver accelerated the Poisson solver component by a factor of 4. 37, ultimately leading to a 1. 31x speedup for the complete flow solver in a nine-billion-cell simulation. This demonstrates the method’s ability to not only accelerate individual components but also to improve the performance of complex, multi-component simulations.
Pipelined Algorithm Achieves High GPU Scalability
This research presents a highly scalable tridiagonal matrix algorithm, Pipelined-TDMA, designed to overcome performance bottlenecks in multi-GPU systems. The key innovation lies in executing multiple tridiagonal systems in a pipelined fashion, effectively overlapping communication with computation and running GPU kernels concurrently. This approach successfully hides much of the non-scalable execution time typically associated with inter-GPU communication and limited GPU occupancy. Performance evaluations using up to 64 GPUs demonstrate significant improvements in both strong and weak scalability.
The algorithm achieves near-ideal weak scaling with one billion grid cells per GPU and attains 74. 7% parallel efficiency in strong scaling tests with a four-billion-cell problem, relative to a baseline of four GPUs. Integrating this optimized solver into a flow solver resulted in a 1. 31x overall speedup in a nine-billion-cell simulation on 64 GPUs, with the TDMA component itself accelerated by 4. 37x.
The authors acknowledge that the performance of the pipeline depends on communication time remaining shorter than computation time, a condition generally met in practical applications. The research suggests the algorithm’s effectiveness could extend to at least 256 GPUs, given similar conditions to those tested. Future work could explore the limits of this scalability and investigate its application to even larger and more complex simulations.
👉 More information
🗞 A Highly Scalable TDMA for GPUs and Its Application to Flow Solver Optimization
🧠 ArXiv: https://arxiv.org/abs/2509.03933
