Density functional theory calculations, essential for modelling materials and plasmas, often demand substantial computational resources, prompting researchers to explore the power of modern graphics processing units. Atsushi M. Ito from the National Institute for Fusion Science, along with colleagues, now presents a new implementation of the QUMASUN code designed to run efficiently on a variety of GPU architectures. This achievement simplifies the complex process of adapting computational codes for different hardware, and benchmarks on cutting-edge GPUs, including AMD MI300A and GH200, reveal significant speedups of over two times compared to traditional CPU-based calculations. The team demonstrates that this GPU-portable approach accelerates key computational kernels, such as fast Fourier transforms and matrix operations, offering a substantial boost for a wide range of plasma-fusion simulations and materials science applications.
The work demonstrates significant acceleration of computationally intensive kernels crucial for plasma-fusion simulations, with speedups ranging from 2. 0 to 2. 8times faster than a 256-core Xeon node for diamond and tungsten systems. This improvement stems from accelerated compute-bound kernels, specifically fast Fourier transforms (FFT), dense matrix-matrix multiplications (GEMM), and eigenvalue solvers, indicating broad applicability beyond the specific calculations performed. The authors present a combination of code optimization, a novel eigenvalue solver acceleration technique, and detailed performance benchmarking. Key findings demonstrate that GPUs significantly accelerate calculations, with the GH200 achieving speedups of 3 to 7times over the CPU baseline. Detailed performance analysis of critical kernels revealed bottlenecks and opportunities for optimization, highlighting the importance of batching FFTs for improved GPU performance. The study also revealed that cuSolver (NVIDIA) is currently better optimized than rocSolver (AMD). 0 and 2. 8 observed on the GPUs compared to a multi-core CPU node for diamond and tungsten systems. The team achieved this portability through a lightweight C++ layer, enabling execution on CPUs, CUDA-enabled devices, and AMD’s HIP platform without requiring extensive code modifications. While the current work focuses on diamond and tungsten, the researchers note the potential for wider application across various materials science and plasma physics simulations.
Further Optimizations Enhance GPU Performance
Scientists achieved further performance gains by implementing a novel transformation method utilizing twice the TRSM calls, yielding an additional 1. Detailed analysis of FFT performance revealed that batch processing 512 wave functions in a single call significantly improves performance on GPUs, while single FFT executions, particularly with small grid sizes (under 128), can degrade performance. Experiments demonstrated that CPUs, when processing 512 wave functions distributed across 256 cores, can outperform GPUs for very small grid sizes (under 64) due to data fitting within the CPU cache. These advancements are expected to benefit a broad range of plasma-fusion simulation codes beyond the initial RS-DFT implementation.
👉 More information
🗞 GPU-Portable Real-Space Density Functional Theory Implementation on Unified-Memory Architectures
🧠 ArXiv: https://arxiv.org/abs/2512.04447
