GPUs now underpin much of high performance computing, and their use is expanding rapidly within High Energy Physics. Mohammad Atif (Brookhaven National Laboratory), Meghna Bhattacharya and Mark Dewing (Fermi National Accelerator Laboratory) alongside Zhihua Dong, Julien Esseiva, and Oliver Gutsche et al, investigated how well different GPU portability layers cope with varied application demands. Avoiding vendor lock-in from CUDA programming is crucial for wider deployment, but layers like Kokkos, Alpaka, SYCL, OpenMP and std::par each have strengths and weaknesses. This research significantly advances the field by identifying key application characteristics , assessed through representative workloads from CMS, DUNE and ATLAS , that demonstrably influence portability layer performance, empowering developers to select the optimal technology for their specific needs.

GPU Portability Layer Performance on Heterogeneous Hardware

Scientists have demonstrated a comprehensive study evaluating the performance of various GPU portability layers, crucial for maintaining flexibility in high-performance computing as the landscape of GPU providers diversifies. This work addresses a critical challenge: avoiding vendor lock-in by enabling code to run efficiently on hardware from NVIDIA, AMD, and Intel, without being restricted by CUDA or other proprietary languages. The team achieved a detailed understanding of performance variations, identifying key characteristics within representative applications from major high energy physics experiments that significantly impact the choice of portability technology.
The study focused on analysing heterogeneous applications from CMS (patatrack and p2r), DUNE (Wire-Cell Toolkit), and ATLAS (FastCaloSim), meticulously identifying application characteristics that exhibit differing behaviours across the various portability technologies. Experiments show that launching a kernel onto a GPU isn’t instantaneous, with latency ranging from microseconds to tens of microseconds, dependent on GPU architecture, CPU speed, and driver versions. Notably, Kokkos was found to add significant launch latency, particularly on AMD GPUs, potentially negating performance gains for short-running kernels. The team discovered significant incompatibilities between Kokkos and concurrency mechanisms like Intel’s Threading Building Blocks, due to serialisation locks and restrictions on external threads, severely hindering multi-threaded application performance. Furthermore, inconsistencies were observed in SYCL implementations, with some serialising concurrent kernel calls depending on the compiler manufacturer and version, demonstrating the rapid evolution of these technologies. These findings underscore the need for careful evaluation of portability layer support for existing concurrency models.

The work opens new avenues for informed decision-making for developers, providing a clear understanding of how different portability technologies handle various use cases and code structures. Researchers also investigated external library and compiler compatibility, recognising that many HEP applications rely on diverse external libraries. By meticulously analysing these factors, the study provides a valuable resource for optimising application performance and ensuring portability across a rapidly evolving hardware landscape. Ultimately, this research empowers developers to select the most appropriate GPU portability technology, maximising efficiency and future-proofing their applications for the next generation of high-performance computing platforms.

GPU Portability Layer Performance of Scientific Applications

Scientists undertook a detailed study of application and kernel characteristics to guide the selection of GPU portability layers, analysing representative heterogeneous applications from CMS (patatrack and p2r), DUNE (Wire-Cell Toolkit), and ATLAS (FastCaloSim). The research identified key application characteristics exhibiting differing behaviours across various portability technologies, enabling developers to make informed decisions regarding optimal GPU portability solutions. Experiments employed serial and threaded backends of Kokkos, revealing that its serial backend implements a lock serialising concurrent calls, while the threads backend explicitly forbids calls from external threads, resulting in poor performance for multi-threaded applications. Concurrent GPU kernel launches were only achievable with CUDA and HIP backends, utilising architecture-specific APIs that inherently limit portability, though Kokkos is developing a prototype feature of “partitioned execution spaces” to potentially address this limitation.

The study also observed inconsistencies in SYCL implementations’ ability to launch GPU kernels, with some serialising concurrent kernel calls from different threads, often dependent on compiler manufacturer and version. Furthermore, architectures like AMD appeared to lack any concurrent SYCL solution, highlighting the rapid evolution and fragmentation within the SYCL landscape. Researchers meticulously examined external library and compiler compatibility, noting that many versions of Eigen are incompatible with the NVIDIA CUDA compiler nvcc, impacting applications utilising Kokkos and Alpaka which employ nvcc for NVIDIA device compilation. ROOT, a widely used HEP library, currently lacks full compatibility with the NVIDIA C++ compiler nvc++, necessitating complex build rules involving g++ and nvc++ to link compatible code segments, demanding careful data allocation to avoid performance penalties.

Scientists investigated the impact of data structure complexity, finding that complex, object-oriented data structures common in HEP experiments map poorly onto GPUs preferring flat layouts. Portability layers like Kokkos, SYCL, and Alpaka offer constructs such as Kokkos::Views and sycl::buffers to enable portability, but these introduce overheads for allocations and data transfers, particularly with numerous small objects; initialising Kokkos::Views can significantly impact performance, though this can be avoided using Kokkos::ViewAllocateWithoutInitializing. The team also observed that automatic memory transfers using Unified Shared Memory were invariably slower than explicit transfers, and that none of the APIs could gracefully represent multidimensional vectors of varying sizes, requiring manual crafting or padding. Finally, the work assessed the performance of Random Number Generators and Fast Fourier Transforms, noting that native implementations from NVIDIA, AMD, and Intel are not portable, while Kokkos provides its own consistent implementations across architectures.

GPU Portability Layer Launch Latency Variations are significant

Experiments revealed that the initial call to rocRAND, AMD’s hardware random number generator, resulted in launch latencies sometimes exceeding 1 second on AMD GPUs. This substantial delay poses a significant challenge for applications with short kernel runtimes, where launch latency can dominate execution time. The team measured that kernel runtime, if comparable to launch latency, can be severely impacted, necessitating techniques like asynchronous kernel launches to mitigate the issue. Data shows that careful consideration of launch latency is critical when selecting a portability layer for performance-sensitive applications.

Further analysis focused on concurrency and thread pool compatibility, uncovering significant incompatibilities with Kokkos. The serial backend of Kokkos implements a lock that serializes concurrent calls, while the threads backend explicitly forbids calls from external threads, resulting in poor performance for multi-threaded applications. Scientists recorded that concurrent GPU kernel launches with Kokkos are limited to CUDA and HIP backends, restricting portability. These layers aim to mitigate vendor lock-in associated with CUDA programming, enabling code to run on GPUs from NVIDIA, Intel, and AMD. Researchers analysed applications from CMS, DUNE, and ATLAS, identifying key characteristics influencing the suitability of each portability layer. The study demonstrates that no single solution is universally optimal; the best choice depends heavily on the specific application’s attributes and the runtime environment.

The findings highlight the importance of careful analysis before selecting a portability layer, considering factors like application externals and the target hardware. Developers must evaluate compatibility, performance, and portability to avoid provisioning and performance issues. While these layers are rapidly evolving, with increasing feature support and performance optimisation, continuous monitoring is crucial to ensure a good match between the code base and the chosen layer. Future advancements, such as the C++ standards for concurrency and offloading, may simplify these choices in the long term, though widespread hardware support remains several years away.

👉 More information
🗞 Evaluating Application Characteristics for GPU Portability Layer Selection
🧠 ArXiv: https://arxiv.org/abs/2601.17526

Tags:

Alpaka CUDA FastCaloSim! GPUs High Energy Physics computing Kokkos OpenMP patatrack SYCL Wire-Cell Toolkit

GPU Portability Layers: Evaluating Application Characteristics for NVIDIA and Intel Deployments

GPU Portability Layer Performance on Heterogeneous Hardware

GPU Portability Layer Performance of Scientific Applications

GPU Portability Layer Launch Latency Variations are significant

Rohail T.

Latest Posts by Rohail T.:

IRID + AIMING: The Pure-Play Quantum Computing Stocks vs Tech Giants Defining the Next Computing Era

GPU Batch SVD Solver Achieves Unmatched Performance for Numerous Small Problems

Robust Trajectories Achieved: Chance-Constrained Covariance-Steering for Two & Three-Dimensional Spaceflight