Scalable Interconnection Networks Enhance Communication for Heterogeneous Post-Exascale Supercomputers and Data Centers

The relentless demand for greater computing power, fuelled by applications like artificial intelligence and complex scientific modelling, presents significant challenges for modern supercomputers and data centres. Joaquin Tarraga-Moreno, from Universidad de Castilla-La Mancha, Daniel Barley of Heidelberg University, and Francisco J. Andujar Munoz from Universidad de Valladolid, alongside their colleagues, address a critical issue: communication bottlenecks that arise as systems become more complex and incorporate numerous accelerators. Their research investigates novel interconnection networks designed to efficiently manage data flow both within individual computing nodes and between them, paving the way for the next generation of high-performance computing infrastructure. By tackling these communication limitations, the team’s work promises to unlock substantial improvements in computational speed and efficiency for a wide range of data-intensive tasks.

DGX GH200 Architecture Performance and Network Modeling

Heterogeneous and tightly integrated systems are crucial for modern computing. This paper analyzes the performance of the NVIDIA DGX GH200 architecture, a high-bandwidth system designed for demanding workloads like artificial intelligence. Researchers modeled the DGX GH200 architecture using a network simulator to understand its communication capabilities. The fundamental building block, the Grace Hopper Superchip, consists of a Grace CPU connected to a Hopper GPU via an NVLink-C2C interconnect delivering 3 Gbps of bandwidth. The Superchip interfaces with a network via PCIe 5.

0 and NVLink connections. A compute tray comprises eight GH200 Superchips connected to three NVLink switches, with each Superchip establishing six links to each switch, providing 1. 5 Tbps of bandwidth per Superchip. A complete DGX GH200 system with 256 Superchips consists of 32 compute trays, 96 Level 1 (L1) switches, and Level 2 (L2) switches, with each L1 switch connecting to twelve L2 switches with 4 Tbps of bandwidth. The performance of four configurations (32, 64, 128, and 256 GPUs) was analyzed using a random all-to-all traffic pattern, varying the traffic load from 0% to 100% of the Superchip’s capacity. The system achieved a maximum throughput of 300 Tbps, with all four configurations saturating at similar traffic loads, around 50% of the maximum capacity. This saturation is attributed to the slimmed fat-tree topology, which performs optimally when communication is confined within individual chassis of eight GPUs.

Supercomputer Communication Architectures and Topologies

The study addresses communication challenges in modern supercomputers and data centers, focusing on how accelerators such as GPUs and TPUs exchange data efficiently. Researchers meticulously analyzed several real-world systems, including the NVIDIA DGX GH200, to comprehensively understand existing communication architectures and identify areas for improvement. This analysis informed the proposal of novel intra- and inter-node network topologies designed to enhance overall system performance. This system utilizes the NVIDIA Grace Hopper GH200 superchip, integrating an Arm-based CPU with a Hopper GPU within a single package.

These components connect via NVLink-C2C, a high-bandwidth, low-latency interface delivering 900 GB/s of bidirectional bandwidth, a substantial improvement over PCIe Gen5 technology. The superchip incorporates 72 Arm Neoverse V2 CPU cores, 96 GB of HBM3 GPU memory with 4 TB/s bandwidth, and up to 480 GB of LPDDR5X system memory achieving 500 GB/s bandwidth, creating a powerful heterogeneous computing unit. To scale beyond a single superchip, NVIDIA employs the NVLink Switch System, powered by fourth-generation NVLink and third-generation NVSwitch ASICs. Each NVLink Switch delivers 25. 6 Tb/s of full-duplex bandwidth.

This system forms a two-level, non-blocking slimmed fat-tree topology capable of connecting up to 256 GH200 superchips. Within each chassis, groups of eight Grace Hopper modules connect via three NVLink Switch trays, providing 3. 6 TB/s of intra-chassis bisection bandwidth. At the system level, 36 NVLink switches interconnect 32 chassis, achieving a total bisection bandwidth of 115. 2 TB/s, over nine times higher than an NDR400 InfiniBand fabric. This hierarchical topology aims to minimize communication bottlenecks and maximize data throughput in large-scale computing environments.
Gbps Interconnect Achieves 450 Tbps Throughput

Scientists have achieved a breakthrough in high-performance computing interconnects, demonstrating a system capable of 450 Terabits per second (Tbps) of throughput. This achievement addresses the growing demands of data-intensive applications like generative AI and scientific simulations, which require increasingly powerful and interconnected computing resources. The research focused on the NVIDIA DGX GH200 system, a novel architecture designed to overcome communication bottlenecks in accelerator-rich environments. The DGX GH200 utilizes Grace Hopper superchips, integrating Arm-based CPUs with NVIDIA Hopper GPUs within a single package.

These components are interconnected via NVLink-C2C, delivering 900 Gigabytes per second of bidirectional bandwidth, a seven-fold increase over PCIe Gen5 technology. Each superchip combines 72 Arm Neoverse V2 CPU cores, 96 Gigabytes of HBM3 GPU memory with 4 Terabytes per second bandwidth, and up to 480 Gigabytes of LPDDR5X system memory delivering 500 Gigabytes per second bandwidth. This unified memory access model allows both the CPU and GPU to share and access memory coherently, significantly reducing data movement overhead. To scale performance beyond a single superchip, researchers implemented the NVLink Switch System, powered by fourth-generation NVLink and third-generation NVSwitch ASICs.

Each NVLink Switch provides 25. 6 Terabits per second of full-duplex bandwidth, enabling high-speed communication between multiple superchips. Experiments using traffic simulations under random all-to-all workloads demonstrate that the DGX GH200 interconnect substantially outperforms traditional RLFT networks, achieving the unprecedented 450 Tbps throughput. This breakthrough delivers a significant advancement in intra-node communication, paving the way for more efficient and scalable high-performance computing systems.

Fat-Tree Topology Limits DGX GH200 Throughput

This research demonstrates a detailed analysis of network performance within large-scale computing systems, specifically focusing on configurations of NVIDIA DGX GH200 clusters. The team systematically evaluated different interconnect configurations, varying the number of GPUs and associated network switches, to identify bottlenecks and maximize throughput. Results show the system achieved a peak throughput of 450 terabits per second when utilizing the entire cluster, indicating a viable solution for data-intensive artificial intelligence workloads. Importantly, the study reveals that all tested configurations saturate at similar traffic loads, around 50% of the system’s potential capacity. The findings highlight the limitations imposed by the network topology, a slimmed fat-tree design, which performs optimally when communication is confined to individual chassis of eight GPUs. While the system demonstrates high performance, the research indicates that overall throughput is constrained by the bandwidth between the first and second-level switches.

👉 More information
🗞 Scalable and Efficient Intra- and Inter-node Interconnection Networks for Post-Exascale Supercomputers and Data centers
🧠 ArXiv: https://arxiv.org/abs/2511.04677

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Topology-aware Machine Learning Enables Better Graph Classification with 0.4 Gain

Llms Enable Strategic Computation Allocation with ROI-Reasoning for Tasks under Strict Global Constraints

January 10, 2026
Lightweight Test-Time Adaptation Advances Long-Term EMG Gesture Control in Wearable Devices

Lightweight Test-Time Adaptation Advances Long-Term EMG Gesture Control in Wearable Devices

January 10, 2026
Deep Learning Control AcDeep Learning Control Achieves Safe, Reliable Robotization for Heavy-Duty Machineryhieves Safe, Reliable Robotization for Heavy-Duty Machinery

Generalist Robots Validated with Situation Calculus and STL Falsification for Diverse Operations

January 10, 2026