Modern scientific simulations increasingly demand efficient use of heterogeneous hardware, and researchers led by Pawel K. Radtke and Tobias Weinzierl from Durham University are addressing critical challenges in achieving this goal. Their work investigates how data layout transformations and reduced precision calculations impact performance on diverse GPU platforms, including Nvidia and AMD architectures. The team demonstrates that strategically converting between Array of Structures (AoS) and Structure of Arrays (SoA) data formats, combined with lower-precision calculations, can significantly accelerate simulations, achieving speedups of around 2. 6 on certain Nvidia platforms. This research introduces compiler tools that empower programmers to fine-tune these transformations and data movement, offering a flexible approach to optimising performance for a broad range of scientific codes and unlocking the full potential of modern supercomputing systems.

The core idea is to facilitate efficient data layout transformations, converting data from an array of structures (AoS) to a structure of arrays (SoA) format, and automatically offloading computations to GPUs with minimal code changes. Many scientific codes currently use AoS layouts, which are natural for object-oriented programming but inefficient for modern processing architectures. SoA layouts are much better suited for Single Instruction, Multiple Data (SIMD) and GPU processing, but manually converting code is a tedious and error-prone process.

This new extension addresses this challenge by allowing developers to specify the desired data layout for structures using new C++ attributes, and the compiler transforms the code accordingly. A key feature is a data view abstraction, which allows access to data in either AoS or SoA format without changing the underlying storage, providing flexibility and avoiding code duplication. Experimental results demonstrate that the compiler extension significantly improves the performance of particle-based simulations, particularly on GPUs. The automatic SoA conversion and GPU offloading lead to substantial speedups compared to manually optimized code. The forked Clang/LLVM project with these extensions is publicly available, providing a powerful tool for scientific programmers to write high-performance, portable code for modern hardware architectures. It automates many of the tedious and error-prone tasks associated with data layout transformation and GPU offloading, allowing developers to focus on the core logic of their applications.

GPU Particle Simulation with Data Layouts

This study pioneers a novel approach to particle simulation by systematically evaluating the performance of array-of-structures (AoS) and struct-of-arrays (SoA) data layouts in conjunction with reduced-precision data formats on modern GPU architectures. Researchers engineered a flexible simulation framework capable of operating with either contiguous or scattered AoS particle data, allowing particle properties to be accessed directly or indirectly via pointers. This design facilitates detailed investigation of data layout impacts on performance, particularly when combined with reduced precision calculations. The team focused on Smoothed-Particle Hydrodynamics (SPH), a mesh-free method representing fluids as interacting particles, but emphasizes the techniques extend beyond SPH to any domain involving repeated traversal and transformation of structured aggregates.

The simulation model implements core SPH equations, including density and momentum updates, discretised into particle interactions. These interactions involve weighted sums over neighbouring particles, determined by a smoothing kernel function, and the study meticulously tracks the computational cost of these pairwise operations. To enable detailed performance analysis, the researchers developed a system where algorithmic phases are encoded using local or pairwise functions that directly modify particle state within nested for loops, allowing precise control over data access patterns and kernel invocations. To further optimise performance, the study investigates spatial search acceleration techniques, including bounding-volume hierarchies and octrees, which either store local particle buffers for cache utilisation or particle indices into global buffers to minimise memory footprint. The researchers also implemented both tight distance checks for CPUs and relaxed neighbourhood checks with interaction masking for GPUs, allowing them to assess the impact of these choices on different hardware platforms. The work meticulously examines the trade-offs between data layout, spatial search algorithms, and hardware-specific optimisation strategies, ultimately aiming to unlock significant performance gains in particle simulations.

Data Layout and Precision Optimisation for SPH

This work presents a breakthrough in optimizing particle simulation codes, specifically for Smoothed Particle Hydrodynamics (SPH), by exploring data layout transformations and reduced-precision computing on modern GPU architectures. Researchers hypothesized that a struct-of-arrays (SoA) layout would particularly benefit Single Instruction, Multiple Thread (SIMT) execution models, while array-of-structures (AoS) remains common in many Lagrangian codes. The study rigorously investigates whether data conversions and precision reductions should occur on the CPU or be deployed directly on the GPU, especially given the increasing integration of CPUs and GPUs with shared memory spaces. Experiments demonstrate significant performance gains through careful orchestration of data layout and precision.

Specifically, utilizing Nvidia G200 platforms, the team achieved a speedup of approximately 2. 6 for certain compute kernels. Researchers permit particle properties to be accessible directly or indirectly via particle-held pointers, and encode algorithmic phases using local or pairwise functions. The simulation model assumes a conservative baseline typical of modern SPH codes, working with either contiguous or scattered AoS particle data. The study defines a standard set of per-particle quantities, including position, velocity, internal energy, mass, and smoothing length, and explores their storage formats within both AoS and SoA layouts. These detailed investigations confirm the potential for substantial performance improvements through intelligent data management and precision control in particle simulations. The techniques developed are applicable not only to SPH but also to a wide range of Lagrangian codes and beyond.

AoS, SoA and GPU Performance Limits

This study investigated the performance of transforming data between Array of Structures (AoS) and Structure of Arrays (SoA) formats, combined with reduced-precision data layouts, for particle simulation codes running on various GPU platforms. The results demonstrate that performance is significantly influenced by the interplay between kernel arithmetic intensity, memory access patterns, and interconnect bandwidth. Kernels heavily reliant on memory traffic, such as Kick and Drift, benefited most from these optimizations, while the computationally intensive Force kernel showed more moderate gains. The research revealed that modern GPUs, with their fast memory, do not always benefit substantially from reductions in memory footprint.

Notably, Nvidia’s GH200 platform achieved the strongest performance improvements, with speedups reaching up to 2. 6times using in-place transformations and reduced precision. However, gains on AMD systems were more limited, with improvements largely restricted to small particle counts. Reduced precision alone had a limited overall effect, and the MI200 was the only system to consistently benefit from it. The authors acknowledge that performance gains are vendor-dependent and that further research is needed to fully understand the interactions between hardware and software optimizations. Future work may focus on developing adaptive transformation strategies.

👉 More information
🗞 Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware
🧠 ArXiv: https://arxiv.org/abs/2512.05516

Tags:

AoS-to-SoA transformations compiler annotations data layouts GPU offloading Lagrangian codes MI300A Nvidia G200 particle simulation reduced-precision data layouts SIMT

Compiler-supported Reduced Precision and AoS-SoA Transformations Enhance Heterogeneous Hardware Performance

GPU Particle Simulation with Data Layouts

Data Layout and Precision Optimisation for SPH

AoS, SoA and GPU Performance Limits

Rohail T.

Latest Posts by Rohail T.:

Quantum Networks Overcome Fragility to Synchronise Learning across Distances

Interactions Weaken Precision of Electrical Current in Novel Hybrid Materials

Unhackable Random Number Generator Sidesteps Device Flaws for Ultimate Security