Parallel computing in Julia is becoming a powerful alternative to traditional high-performance computing, and a new library called AcceleratedKernels. A team at the University of Birmingham, led by Andrei-Leonard Nicus, Dominik Werner, and Simon Branford, has developed this system to simplify the creation of parallel algorithms that run efficiently on diverse hardware, including NVIDIA, Intel, and Apple accelerators. AcceleratedKernels. jl achieves this through a unique approach to code transpilation, allowing programmers to write a single codebase and deploy it across multiple architectures without significant modification, and benchmarks demonstrate performance comparable to, and sometimes exceeding, that of highly optimised C and OpenMP implementations. This advance is particularly significant because it unlocks the potential for cost-effective, high-throughput computing, with tests on the Baskerville cluster achieving world-class sorting speeds and demonstrating that the economic viability of communication-intensive tasks hinges on utilising high-bandwidth, direct GPU interconnects
Julia Parallel Sorting for Scientific Data
This research details the development and evaluation of a high-performance parallel sorting implementation in Julia, designed for demanding scientific applications. The team aimed to create a fast and scalable solution for sorting large datasets, a common task in simulations and data analysis. The resulting implementation leverages Julia’s strengths in performance, metaprogramming, and GPU acceleration to achieve significant speed improvements. Key features include Julia’s just-in-time compilation, multiple dispatch, and metaprogramming capabilities, hybrid parallelism supporting both multi-core CPUs and GPUs, and a combination of quicksort and radix sort for optimal performance.
Numerous techniques were employed to enhance speed, including cache-aware data layouts, vectorisation, efficient memory management, and a decoupled look-back technique. Extensive benchmarks demonstrate that this implementation achieves state-of-the-art performance on a variety of datasets and hardware configurations, significantly outperforming standard Julia sorting functions and other popular sorting libraries. This advancement offers substantial benefits for computationally intensive scientific tasks, with potential applications in molecular dynamics simulations, computational fluid dynamics, data analysis, granular materials modelling, and autonomous driving algorithms. The versatility of the implementation makes it a valuable tool for a wide range of research areas.
Portable Parallel Computing via Code Transpilation
The research team developed AcceleratedKernels. jl, a novel library for parallel computing that prioritises portability and performance across diverse hardware, including NVIDIA, AMD, Intel, and Apple accelerators. Unlike many existing approaches, AcceleratedKernels. jl employs a unique transpilation architecture, effectively translating code into instructions optimised for each target processor. This allows developers to write code once and deploy it across a range of systems without significant modification, streamlining development and reducing maintenance burdens.
The core innovation lies in bridging the gap between high-level programming languages and low-level hardware instructions, balancing programmer productivity and computational efficiency. AcceleratedKernels. jl distinguishes itself by offering a high-productivity approach that avoids the limitations of standards-based approaches, API-based systems, and domain-specific languages. The library simplifies parallel programming, allowing developers to express complex computations concisely and intuitively through carefully designed abstractions and efficient code generation techniques. Benchmarks demonstrate that AcceleratedKernels.
jl achieves performance comparable to hand-optimised C and OpenMP implementations, while also offering more consistent numerical results, particularly on modern accelerator hardware. The team validated this approach with large-scale tests on the Baskerville Tier 2 UK HPC cluster, achieving world-class sorting throughputs using numerous NVIDIA A100 GPUs. These results highlight the library’s scalability and efficiency, demonstrating its ability to harness the power of modern high-performance computing systems. Furthermore, the research revealed that utilising direct GPU-to-GPU interconnects significantly improves performance and cost-effectiveness, suggesting such technologies are crucial for communication-intensive applications.
Julia Code Transpiles to Diverse Accelerators
AcceleratedKernels. jl represents a significant advance in parallel computing, offering a unified approach to diverse accelerator hardware from NVIDIA, Intel, and Apple. Unlike many systems requiring separate codebases for each processor type, this library uses a unique transpilation architecture to convert Julia code into the native language of the target hardware, ensuring performance comparable to hand-optimised code. This means developers can write code once and run it efficiently on a variety of platforms, reducing the complexity and effort traditionally required for parallel programming. A key strength of AcceleratedKernels.
jl lies in its flexibility and composability, allowing developers to create highly specialised algorithms within a single language and seamlessly integrate them with existing Julia code. This is particularly notable as it enables simultaneous CPU-GPU processing, such as co-sorting data across both types of processors, without requiring any special modifications to the libraries involved. Performance benchmarks demonstrate that AcceleratedKernels. jl achieves speeds on par with, and sometimes exceeding, those of highly optimised C and OpenMP CPU code. In multi-node, multi-device sorting tests on a major UK high-performance computing cluster, the library attained world-class throughputs of 538-855 GB/s using numerous NVIDIA A100 GPUs, approaching the highest reported figures achieved with significantly larger CPU core counts.
Furthermore, the use of direct GPU-to-GPU interconnects resulted in an average speedup of 4. 93x, highlighting the importance of efficient communication pathways for performance. Beyond speed, AcceleratedKernels. jl also offers improved numerical consistency compared to traditional C code, meaning the performance of compiled programs is more predictable and reliable. The library’s design, built on Julia’s homoiconic nature and on-demand compilation model, allows for highly generic code with minimal explicit type specification, further simplifying development and enhancing performance through aggressive inlining. The library’s architecture also promises ease of adaptation to future hardware, such as TPUs and FPGA-like devices, reducing the long-term cost of maintaining parallel codebases.
👉 More information
🗞 AcceleratedKernels.jl: Cross-Architecture Parallel Algorithms from a Unified, Transpiled Codebase
🧠 DOI: https://doi.org/10.48550/arXiv.2507.16710
