OpenAI is scaling AI training to utilize 100,000 GPUs, a leap that is quickly revealing a surprising obstacle: “system or network ‘noise’,” a phenomenon borrowed from the world of High Performance Computing. As computations expand, communication between these massive clusters is increasingly dominated by outliers, threatening the performance of synchronous pretraining where each computational step requires lock-step coordination across thousands of processors. The entire system’s efficiency is bottlenecked by the slowest transfer, meaning even with sophisticated parallelism techniques, a single point of failure can halt progress. To address this, OpenAI designed and deployed Multipath RC (MRC), an extension to RoCE that automatically bypasses failed links and balances network load, aiming for resilient AI supercomputer networking.

Synchronous Pretraining Challenges at Extreme GPU Scale

Scaling AI training to 100,000 GPUs introduces a surprising bottleneck: systemic “noise” within the network itself. OpenAI and Microsoft researchers are confronting limitations previously observed in High Performance Computing (HPC) as they push the boundaries of synchronous pretraining for large language models. This approach, where computations across vast numbers of GPUs are locked in step, is proving increasingly susceptible to slowdowns caused by even minor inconsistencies in network performance. The core challenge lies in the fact that the duration of each communications round is determined by the slowest transfer, regardless of the parallelism techniques employed, pipeline, data, tensor, or expert. This means that even with substantial redundancy built into the system, the entire operation can be held hostage by a single sluggish connection. As computations scale, this issue is exacerbated by what the team describes as “system or network ‘noise’,” a phenomenon well-known in the HPC community.

To combat these issues, the collaboration designed Multipath RC (MRC), extending the RoCE Reliable Connection semantic layer. MRC employs techniques like packet spraying and adaptive load balancing, aiming to distribute traffic evenly and bypass failures automatically. Researchers state that failed links or misbehaving switches need to be bypassed automatically, emphasizing the need for a self-healing network capable of operating with minimal human intervention. A key design choice was disabling dynamic routing, opting instead for source-routed packets using IPv6 segment routing to avoid conflicts between adaptive mechanisms and simplify network management, ultimately striving for a resilient and high-performance training cluster at this scale.

Multipath RC (MRC) Extends RoCE for AI Networks

The increasing scale of artificial intelligence training demands increasingly resilient network infrastructure, a challenge acutely felt as systems worked to enable scale to 100,000 GPUs. To combat these limitations, a collaboration between OpenAI, Microsoft, Broadcom, and Nvidia developed Multipath RC (MRC), an extension of the RoCE protocol. MRC employs packet spraying and adaptive load balancing based on Explicit Congestion Notification. This approach allows the system to bypass failed links automatically, a critical feature given the prevalence of network issues at extreme scale. Researchers note that using static routing seems counterintuitive, but they explain how this combination leads to a highly resilient, high-performance training cluster. MRC has been implemented in NICs from Nvidia, AMD, and Broadcom, and is already in production use, training large language models for applications like ChatGPT and Codex, and the specification has been released under an open license through OCP.

Ultra Ethernet Transport (UET) Influences MRC Design

OpenAI and Microsoft researchers are tackling a critical challenge in scaling artificial intelligence: maintaining network resilience amidst increasingly massive training runs. With OpenAI having completed work to enable scaling to 100,000 GPUs for a single synchronous pretraining job, the teams have developed Multipath RC (MRC), a new transport protocol heavily influenced by Ultra Ethernet Transport (UET). Unlike UET, MRC is designed as a minimal extension to the existing RoCE Reliable Connection semantic layer, leveraging the Verbs API but supporting only RDMA write and write-with-immediate operations at the transport level. This co-design approach allowed the teams to build training clusters with high resilience, even disabling dynamic routing in favor of static IPv6 segment routing (SRv6) because, as the researchers explain, they didn’t want two adaptive routing mechanisms interacting with each other. The resulting multi-plane topology, utilizing 8x100Gb/s ports from 800Gb/s NICs, offers lower latency, increased node reachability to 256 rather than 32 nodes in one hop, reduced costs, and a lessened impact from individual network failures, ultimately simplifying management of these enormous AI training runs.

We took the unusual position of disabling dynamic routing in the switches because we didn’t want two adaptive routing mechanisms interacting with each other and dynamic routing wasn’t adding anything.

MRC Implementation Across NVIDIA, AMD, and Broadcom NICs

The demand for resilient networking in artificial intelligence supercomputers is driving innovation in network interface card (NIC) design, as evidenced by the collaborative development and deployment of Multipath RC (MRC) across hardware from NVIDIA, AMD, and Broadcom. OpenAI, Microsoft, and these manufacturers have successfully implemented MRC in 400 and 800Gb/s RDMA NICs, including NVIDIA ConnectX-8, AMD Pollara and Vulcano, and Broadcom Thor Ultra, to address the challenges of scaling AI training to unprecedented levels. This isn’t simply about bandwidth; the core problem lies in mitigating “system or network ‘noise’” which, borrowed from high-performance computing, increasingly dominates performance at scale. The design prioritizes resilience, allowing for graceful degradation even with link or fabric failures, and crucially, operates with a simplified control plane, reducing the need for manual intervention. Notably, the collaboration took the unusual position of disabling dynamic routing in the switches to avoid conflicts with MRC’s adaptive routing, instead relying on static paths using IPv6 segment routing.

800 Gb/s vs. 100 Gb/s Multi-Plane Topology Options

The drive to scale AI training to unprecedented levels is forcing a re-evaluation of network topology, moving beyond conventional designs to address inherent limitations. OpenAI and Microsoft have been co-designing network infrastructure alongside their AI models, recognizing that simply adding more bandwidth isn’t enough; resilience and efficient load balancing are paramount. A conventional three-tier Clos topology utilizing 51.2Tb/s switches presents challenges at this scale, potentially requiring four tiers or network oversubscription to accommodate such a massive GPU count. An alternative approach, detailed in their recent work, involves leveraging the bandwidth of 800Gb/s NICs by breaking them out into eight 100Gb/s ports and constructing a multi-plane network. This configuration allows for a two-tier Clos topology, increasing node reachability in one hop from 32 to 256. The benefits extend to cost and power consumption, requiring two-thirds of the optics and three-fifths the number of switches for comparable bisection bandwidth.

Critically, the impact of individual failures is lessened; losing a link in the 800Gb/s plane reduces capacity by 12%, compared to 0.4% in the 100Gb/s plane. This multi-plane design, however, demands a robust transport protocol capable of surviving link and port failures while evenly distributing load across all available paths. Researchers note that this is hard to do with traditional single-path transport protocols, leading to the development of Multipath RC (MRC) to address these challenges and ensure high performance even with lower-speed links.

We find that static source routing gives us very good observability and reduces operational burden, while MRC’s resilience means that many network failures are not even urgent to repair.

Full Bisection Bandwidth Achieved with Two-Tier Networks

OpenAI and Microsoft have successfully deployed a network architecture capable of achieving full bisection bandwidth, a critical advancement for scaling artificial intelligence training to unprecedented levels. Facing limitations with conventional network designs when attempting to connect 100,000 GPUs, the collaboration opted for a two-tier network topology, diverging from the typical three-tier Clos arrangement. This innovative approach breaks out 800Gb/s network interface cards into eight 100Gb/s ports, enabling the construction of eight parallel 100Gb/s Clos planes using existing 51.2Tb/s switches. The resulting network boasts several advantages: latency is reduced because the longest path traverses only three switches, and more nodes are reachable in one hop, improving locality and reducing load. The design reduces both cost and power consumption, requiring 2/3 of the optics and 3/5 the number of switches compared to a three-tier network.

SRv6 Source Routing Enables Static Path Resilience

As AI training workloads expand to encompass unprecedented scales, work was done to enable scaling to 100,000 GPUs, the challenge of maintaining stable network performance has intensified, mirroring issues previously encountered in high-performance computing (HPC) environments. The emergence of “system or network ‘noise’” as a primary bottleneck highlights a critical vulnerability: even with substantial redundancy, the entire training process can be stalled by a single slow transfer during synchronous pretraining. Recognizing the potential for conflicting adaptive routing mechanisms, the collaboration deliberately stepped away from conventional network management. Instead of relying on dynamic routing protocols, they opted to disable them entirely, choosing to utilize static paths defined through IPv6 segment routing (SRv6). The team implemented support for SRv6 in NVIDIA Spectrum-4 and 5 switches, as well as Broadcom Tomahawk 5 switches, and has released the MRC specification under an open license through OCP for wider adoption.

MRC maps out EVs that traverse the failed switch and restores them afterwards, with negligible effect on job performance.

Incast Mitigation via Packet Spraying and Trimming

Researchers found that the entire system’s performance is dictated by “the slowest transfer” during synchronous pretraining, even with sophisticated parallel processing techniques in place. The team deliberately disabled dynamic routing within the switches, believing it unnecessary given MRC’s adaptive capabilities; instead, data packets are source-routed using IPv6 segment routing. Crucially, MRC also incorporates packet spraying to mitigate incast, a form of congestion where numerous packets arrive simultaneously at a single destination. By strategically managing packet delivery, the system can maintain stability even under heavy load, and the multi-plane topology, utilizing eight parallel 100Gb/s planes, further enhances resilience, reducing the impact of individual link or node failures.

Operational Simplicity: Disabling Dynamic Routing in Switches

OpenAI and Microsoft’s work to enable AI training at a scale of up to 100,000 GPUs has necessitated a radical simplification of network management, moving away from conventional dynamic routing protocols. This counterintuitive decision stemmed from a desire to avoid conflicts between adaptive routing mechanisms and to streamline operations, given the complexity of managing networks supporting multiple supercomputers and simultaneous training jobs. The team found that MRC proved so effective at adaptive load balancing and failure recovery that dynamic routing offered no additional benefit. The benefits extend to topology co-design; the team opted for a multi-plane network topology, breaking out 800Gb/s NICs into eight 100Gb/s planes to reduce latency and improve failure tolerance. Researchers note that it is possible to lose a NIC-T0 link without bringing down the training job, highlighting the increased robustness.

Any solution needs to do three things: • Load balance the network evenly, so as to prevent congestion due to flow collisions; • Handle incast-based congestion without creating outliers; • Handle link and fabric failures gracefully, without bringing down the training job.

Source: https://cdn.openai.com/pdf/resilient-ai-supercomputer-networking-using-mrc-and-srv6.pdf

Tags:

AI training GPUs Microsoft Networking NVIDIA

OpenAI Scales AI Training to 100,000 GPUs With New Network

Synchronous Pretraining Challenges at Extreme GPU Scale

Multipath RC (MRC) Extends RoCE for AI Networks

Ultra Ethernet Transport (UET) Influences MRC Design

MRC Implementation Across NVIDIA, AMD, and Broadcom NICs

800 Gb/s vs. 100 Gb/s Multi-Plane Topology Options

Full Bisection Bandwidth Achieved with Two-Tier Networks

SRv6 Source Routing Enables Static Path Resilience

Incast Mitigation via Packet Spraying and Trimming

Operational Simplicity: Disabling Dynamic Routing in Switches

Ivy Delaney

Latest Posts by Ivy Delaney:

Quandela’s Photonic Tech to Accelerate Safran’s Engine Design Cycles

Auburn’s McCurley Earns Two Top Honors for Physics, Math

JuliaHub’s Dyad Cuts R&D Time From Months To Days Alongside $65M Series B