DiOMP, a distributed OpenMP framework, unifies OpenMP with the Partitioned Global Address Space (PGAS) model to simplify programming for high-performance computing systems with multiple GPUs. Evaluations on A100, Grace Hopper, and MI250X demonstrate improved scalability and performance for applications including matrix multiplication and Minimod.
The increasing complexity of high-performance computing (HPC) systems, characterised by escalating core counts and diverse processing units, presents significant challenges to software portability and efficient resource utilisation. Researchers are actively seeking methods to simplify programming models while maintaining scalability across heterogeneous architectures. A team led by Baodi Shan and Barbara Chapman from Stony Brook University, alongside Mauricio Araya-Polo from TotalEnergies EP Research & Technology US, detail a novel approach in their paper, ‘DiOMP-Offloading: Toward Portable Distributed Heterogeneous OpenMP’. They present DiOMP, a distributed OpenMP framework designed to unify offloading to accelerators with a partitioned global address space (PGAS) programming model, thereby streamlining the development of applications for complex HPC environments.
DiOMP: A Unified Framework for Distributed OpenMP and PGAS
The increasing complexity of high-performance computing (HPC) systems, driven by escalating core counts and heterogeneous architectures, poses substantial challenges for developers. Traditional hybrid programming approaches often struggle with efficient distributed GPU memory management and code portability. Researchers have introduced DiOMP, a novel distributed OpenMP framework designed to address these issues by integrating OpenMP target offloading with the Partitioned Global Address Space (PGAS) programming model.
DiOMP builds upon the LLVM/OpenMP infrastructure, leveraging its compiler and runtime capabilities. The framework utilises either GASNet-EX or GPI-2 for inter-node communication, selecting the optimal solution based on network topology and system configuration. Crucially, DiOMP abstracts global memory management, enabling both symmetric and asymmetric GPU memory allocation strategies to maximise resource utilisation and application performance. Symmetric allocation distributes memory equally across all GPUs, while asymmetric allocation allows for varying amounts of memory per GPU, optimising for workloads with uneven data requirements.
The framework simplifies application development by transparently handling data distribution and communication, allowing programmers to concentrate on computational logic rather than parallel programming intricacies. DiOMP accommodates a wide range of HPC systems, from CPU-based clusters to GPU-accelerated architectures, ensuring broad applicability and portability. By providing a unified programming model, it reduces the effort required to adapt applications to different hardware platforms.
DiOMP prioritises code portability and scalability, enabling efficient execution on diverse HPC systems. The framework incorporates a portable collective communication layer, OMPCCL, which ensures compatibility with vendor-specific libraries and minimises reliance on proprietary solutions. This reduces the risk of vendor lock-in and facilitates application migration.
Evaluations demonstrate that DiOMP’s performance benefits stem from its effective management of data distribution, communication, and synchronisation across multiple nodes. By minimising data movement and maximising parallelism, DiOMP enables applications to achieve higher performance and scalability.
The design choices underpinning DiOMP directly influence the reliability, scope, and significance of its findings. Building upon established standards like LLVM/OpenMP and incorporating a portable communication layer promotes code portability and reduces vendor lock-in. The use of PGAS simplifies programming and enhances scalability, but introduces potential overhead associated with remote memory access. Careful optimisation of the runtime system and efficient communication protocols mitigate this overhead, as demonstrated by performance evaluations.
DiOMP prioritises ease of use and programmability, allowing developers to leverage existing OpenMP skills. The framework provides a familiar programming model that simplifies the development of parallel applications. By abstracting the complexities of distributed memory programming, it enables developers to focus on computational logic rather than parallelisation intricacies.
The framework’s modular architecture allows developers to customise and extend its functionality, tailoring it to specific application requirements. This flexibility enables optimisation of application performance and scalability for specific hardware configurations. DiOMP’s ability to handle both symmetric and asymmetric GPU allocations expands its applicability to a wider range of HPC systems and workloads.
DiOMP represents an advancement in parallel programming, offering a unified and efficient framework for developing high-performance applications on diverse HPC platforms. By seamlessly integrating OpenMP with PGAS, DiOMP simplifies the development process and enables developers to achieve higher performance and scalability. The framework’s modular architecture and flexible memory management capabilities make it well-suited for a wide range of applications and HPC environments.
👉 More information
🗞 DiOMP-Offloading: Toward Portable Distributed Heterogeneous OpenMP
🧠 DOI: https://doi.org/10.48550/arXiv.2506.02486
