Researchers are increasingly focused on ensuring computational fluid dynamics (CFD) simulations can efficiently harness the power of modern, multi-GPU supercomputing architectures. Panagiotis-Eleftherios Eleftherakis, George Anagnostopoulos, and Anastassis Kapetanakis from the National Technical University of Athens, alongside Mohammad Umair of KTH Royal Institute of Technology and Jean-Yves Vet from Hewlett Packard Enterprise et al, present a detailed analysis of performance portability for CFD codes at scale. Their work, stemming from the REFMAP project, investigates the Spectral Elements simulation framework SOD2D across various NVIDIA GPU architectures, identifying critical computational hotspots and the impact of full-stack optimisation, from application code to hardware infrastructure. This research is significant because it demonstrates substantial performance variations , up to a 3.91 deviation in speedup , and highlights the limitations of simply projecting single-GPU results to larger multi-GPU systems, demanding informed, multi-level tuning for optimal CFD performance.
This breakthrough, stemming from the REFMAP project, focuses on scalable, GPU-enabled multi-fidelity CFD simulations designed for predicting urban airflow, a critical component of sustainable Innovative Air Mobility. Researchers initially defined and characterised an extensive full-stack design space, encompassing application, software, and hardware parameters to comprehensively map performance variations. This meticulous approach enabled the identification of computational hotspots within SOD2D, specifically within the convection and diffusion kernels, which were then targeted for parallelisation across elements and nodes.
Experiments employed server-grade NVIDIA GPUs and leveraged vendor-specific compiler stacks to assess the impact of memory access optimisations, revealing acceleration speedup deviations. The team harnessed the LUMI multi-GPU cluster to examine SOD2D’s performance at scale, utilising profiling techniques to uncover similar throughput variations and highlight the limitations of performance projections. This necessitated multi-level, informed tuning strategies to optimise the solver for diverse hardware configurations. The high-fidelity branch of REFMAP utilises SOD2D, a Fortran-based Spectral Element Method (SEM) code, where key kernels are parallelised to accelerate turbulence-resolving simulations.
Furthermore, the research integrated GPU-accelerated CFD with data-driven surrogates, generating high-resolution Direct Numerical Simulation (DNS) data to train compact predictors deployable at the edge or in the cloud. This innovative coupling allows for UAV trajectory optimisation and high-fidelity sensor placement, creating a robust substrate for efficient and safe UAV navigation. On CPU-only systems, comparable simulations require weeks to compute, but REFMAP overcomes this bottleneck through extensive autotuning and performance exploration of SOD2D. Experiments revealed that memory access optimizations yielded deviations in acceleration speedup, demonstrating the sensitivity of SOD2D to memory-related parameters. The team measured performance and scalability using a multi-level design space, encompassing application, software, and hardware infrastructure parameters.
Single-GPU performance characterization highlighted the impact of these optimizations, with the ‘full_convec’ kernel consistently identified as the primary computational bottleneck, accounting for up to 70% of total runtime on AMD GPUs and 50% on NVIDIA GPUs. Further analysis of the ‘full_diffusion’ kernel showed it contributing on NVIDIA systems and on AMD, confirming a consistent hotspot behaviour across both simulated cases: a Taylor-Green Vortex and a Channel Flow. Results demonstrate that SOD2D’s execution flow, structured around Runge-Kutta steps, relies on efficient computation of Diffusive and Convective terms. Tests prove that the algorithm distributes element loops across parallel gangs and vectors, fetching global quantities into local arrays before calculating isoparametric gradients and residuals.
Measurements confirm that atomic updates to the global residual array, necessary for memory synchronization, can serialize memory accesses and impact performance. The study focused on the Channel Flow case, representative of urban airflow, to better understand these effects. Profiling on the LUMI multi-GPU cluster revealed throughput variations, highlighting the limits of performance projections and the need for informed, multi-level tuning. Initial results showed a slowdown of a single AMD MI250X Graphics Computing Die compared to a NVIDIA Tesla V100 for 8 million nodes, using single-precision arithmetic without optimizations. Scientists attribute this to a power cap on the LUMI GPUs, the use of a single GCD on the MI250X, a potentially unoptimized Cray compiler stack, and lower single-precision throughput on the MI250X. Memory access optimizations, including kernel splitting to exploit independent computations within the ‘full_convec’ kernel, were implemented to alleviate maxed-out local memory usage and inefficient memory coalescing.
SOD2D Performance Varies Across GPU Architectures
Scientists have demonstrated significant challenges in achieving performance portability for GPU-accelerated computational fluid dynamics (CFD) simulations. This research, focusing on the SOD2D spectral element framework, analysed performance across different NVIDIA GPU architectures and compiler stacks, revealing substantial variations in acceleration speedup, due to memory access optimisations. Further investigation on the LUMI multi-GPU cluster confirmed these throughput variations at scale, highlighting the limitations of extrapolating low-scale optimisation strategies to larger systems. The findings underscore the importance of multi-level, informed tuning for CFD frameworks to adapt efficiently to heterogeneous high-performance computing environments.
Architectural differences between GPU vendors demonstrably impact solver performance and scalability, necessitating parameter-space-informed optimisations. The authors acknowledge that performance prediction at scale, based solely on low-scale optimal memory access decisions, is precarious, and more sophisticated modelling is required. Future work will extend this analysis to new compiler and hardware infrastructure, alongside diverse simulations and input meshes, with a focus on automating parameter optimisation.
👉 More information
🗞 Multi-Partner Project: Multi-GPU Performance Portability Analysis for CFD Simulations at Scale
🧠 ArXiv: https://arxiv.org/abs/2601.14159
