Researchers are tackling a significant bottleneck in the development of increasingly large language models: the inefficient use of diverse GPU hardware. Heehoon Kim, Jaehwan Lee, and Taejeoung Kim, from Seoul National University and Samsung Research respectively, alongside Jongwon Park, Jinpyo Kim, and Pyongwon Suh et al., have developed HetCCL, a new collective communication library designed to overcome limitations in current deep learning frameworks. HetCCL uniquely enables rapid, RDMA-based communication between GPUs from different vendors without requiring driver modifications, offering a pathway to cost-effective and high-performance training across heterogeneous GPU clusters and unlocking the potential for practical large language model development using readily available hardware.
This breakthrough addresses a critical inefficiency in deep learning frameworks, which currently lack support for collective communication across heterogeneous GPU clusters, leading to increased costs and reduced performance.
The research team achieved cross-vendor communication by leveraging optimised vendor libraries, specifically NVIDIA NCCL and AMD RCCL, through two innovative mechanisms integrated within HetCCL. Evaluations conducted on a multi-vendor GPU cluster demonstrate that HetCCL not only matches the performance of NCCL and RCCL in homogeneous environments but also uniquely scales in heterogeneous setups.
This study unveils a practical solution for high-performance training utilising both NVIDIA and AMD GPUs without necessitating alterations to existing deep learning applications. The rapid emergence of trillion-scale deep learning models necessitates high computational capability, often achieved through heterogeneous cluster systems equipped with various hardware accelerators.
GPU-based platforms, particularly those utilising NVIDIA or AMD GPUs, currently dominate deep learning, yet parallel training across GPUs from different vendors has remained a significant challenge due to incompatible communication backends. HetCCL overcomes this limitation by enabling seamless communication, thereby unlocking the potential of heterogeneous GPU clusters for large-scale model training.
The research establishes a method for direct point-to-point communication, utilising RDMA, between GPUs from different vendors. This is achieved through a carefully designed implementation of heterogeneous GPU collective communication operations, abstracting platform-specific APIs and integrating vendor-optimised operations into a unified framework.
Experiments show that HetCCL significantly accelerates the training of large language models in multi-vendor GPU clusters, surpassing the performance of homogeneous setups and avoiding the performance degradation often associated with straggler effects or reductions in model accuracy. By replacing existing communication backends with HetCCL, researchers can now utilise GPUs from both NVIDIA and AMD within existing parallel training code, written in frameworks like PyTorch, without any code modifications.
This work represents the first demonstration of transparent utilisation of all multi-vendor GPUs in a heterogeneous cluster, supporting NVIDIA and AMD GPUs which collectively dominate the accelerator market with approximately 88% and 12% market share respectively. The team’s contributions pave the way for building scalable and cost-effective AI infrastructure, essential for the advancement of next-generation distributed machine learning systems.
Heterogeneous GPU Collective Communication via Direct RDMA Interconnects enables scalable multi-GPU training
Scientists developed HetCCL, a collective communication library designed to unify vendor-specific backends and facilitate RDMA-based communication across GPUs without driver modifications. The research addresses inefficiencies arising from the lack of cross-vendor collective communication support in current deep learning frameworks, particularly within expanding GPU clusters.
HetCCL enables practical, high-performance training utilising both NVIDIA and AMD GPUs concurrently, without requiring alterations to existing deep learning applications. Researchers engineered a method for direct point-to-point communication, leveraging RDMA between GPUs from different vendors. Experiments employed standard InfiniBand and RoCE networks, bypassing the CPU to directly access GPU memory via network interface cards.
This approach circumvents the bandwidth limitations of host memory staging, a common bottleneck in inter-node communication, as depicted in Figure 1a and 1b of the work. The study pioneered the integration of vendor-optimised operations, specifically NVIDIA’s NCCL and AMD’s RCCL, into a unified framework.
The team implemented heterogeneous GPU collective communication operations, abstracting platform-specific APIs to create a seamless interface. This involved registering device memory using vendor-specific APIs, such as cuda/hipMalloc, with the IB Verbs API to enable RDMA-supporting NICs to directly access GPU memory regions.
By replacing original communication backends with HetCCL, existing parallel training code written in frameworks like PyTorch can utilise GPUs from both vendors, as illustrated in Figure 2b. Evaluations on a multi-vendor GPU cluster demonstrated that HetCCL achieves performance comparable to NCCL and RCCL in homogeneous setups, while uniquely scaling in heterogeneous environments, achieving approximately 88% and 12% utilisation of NVIDIA and AMD GPUs, respectively.
Heterogeneous GPU performance via RDMA and a unified communication library enables scalable multi-GPU training
Scientists have developed HetCCL, a novel collective communication library that unifies vendor-specific backends and facilitates RDMA-based communication across GPUs without driver modifications. The research addresses the growing inefficiency and costs associated with expanding GPU clusters with hardware from multiple vendors.
Experiments revealed that HetCCL achieves performance matching NVIDIA NCCL and AMD RCCL in homogeneous GPU setups. Crucially, HetCCL uniquely scales in heterogeneous environments, enabling practical, high-performance training utilising both NVIDIA and AMD GPUs without requiring alterations to existing deep learning applications.
The team measured direct point-to-point communication via RDMA between GPUs from different vendors, a key innovation within HetCCL. This capability bypasses the CPU, significantly reducing memory copy overhead and leveraging the higher bandwidth of the interconnect network. Results demonstrate that HetCCL supports NVIDIA and AMD GPUs, which collectively dominate the accelerator market with approximately 88% and 12% market share respectively.
By replacing native communication backends like NCCL and RCCL with HetCCL, researchers enabled existing parallel training code to seamlessly utilise GPUs from both vendors. The breakthrough delivers a unified framework for heterogeneous GPU clusters, abstracting platform-specific APIs and integrating vendor-optimised operations.
Evaluations on multi-vendor GPU clusters showed substantial performance gains in large language model training. Tests prove that HetCCL avoids straggler effects and maintains model accuracy, achieving faster training speeds than homogeneous setups. Measurements confirm that HetCCL is the first cross-vendor CCL to enable deep learning model training and inference on heterogeneous clusters without source code modifications at any level. This work establishes a crucial foundation for building scalable and cost-effective AI infrastructure.
Heterogeneous GPU cluster training via unified communication and RDMA integration enables scalable deep learning
Scientists have developed HetCCL, a collective communication library designed to improve the efficiency of deep learning training across diverse GPU clusters. Current deep learning frameworks often struggle with communication between GPUs from different vendors, leading to performance bottlenecks and increased costs.
HetCCL addresses this issue by unifying vendor-specific backends and enabling RDMA-based communication without requiring alterations to existing GPU drivers. The library introduces two key mechanisms to facilitate cross-vendor communication while still utilising optimised vendor libraries, specifically NVIDIA NCCL and RCCL.
Evaluations conducted on a multi-vendor GPU cluster demonstrated that HetCCL achieves performance comparable to NCCL and RCCL in homogeneous environments. Importantly, HetCCL uniquely scales effectively in heterogeneous environments, allowing for high-performance training using GPUs from multiple vendors without requiring changes to existing deep learning applications.
The relative error in final loss values across all comparisons was below 7x 10⁻³, remaining within acceptable numerical tolerances. This research significantly expands the possibilities for machine learning practitioners by enabling more flexible use of available accelerators and facilitating larger batch sizes and higher training throughput.
HetCCL removes a key barrier to utilising heterogeneous training infrastructure, which is becoming increasingly common. The authors acknowledge a limitation in that the full potential of the system depends on the underlying RDMA network infrastructure. Future work could explore further optimisations for specific network topologies and investigate the application of HetCCL to a wider range of deep learning workloads.
👉 More information
🗞 HetCCL: Accelerating LLM Training with Heterogeneous GPUs
🧠 ArXiv: https://arxiv.org/abs/2601.22585
