OpenAI is overhauling its supercomputer network infrastructure to accommodate a surge in demand, with more than 900 million people using ChatGPT weekly. The company released MRC (Multipath Reliable Connection), a new networking protocol developed in partnership with AMD, Broadcom, Intel, Microsoft, and NVIDIA, through the Open Compute Project. This signals a shift for OpenAI, previously focused on proprietary supercomputer development, toward fostering industry-wide standards for scaling AI systems. According to OpenAI, MRC is designed to “deliver better models to everyone faster” by building multi-plane, high-speed networks that create redundancy and virtually eliminate core congestion, while also utilizing static source routing to bypass failures and eliminate whole classes of routing failure. This resilience-focused design is critical for maintaining the momentum of large-scale AI training, particularly with synchronous pretraining workloads that act as a failure amplifier.

Stargate Supercomputer Design Drives MRC Development

OpenAI’s new Stargate supercomputer is becoming core infrastructure helping people and businesses around the world build with increasingly capable models, supporting more than 900 million people using ChatGPT every week. The sheer scale of demand has prompted the company to move beyond solely proprietary infrastructure, releasing the Multipath Reliable Connection (MRC) protocol through the Open Compute Project. The impetus for MRC stems from the challenges of maintaining consistent performance within Stargate, a supercomputer designed to minimize network congestion and mitigate the impact of failures. Prior to Stargate, OpenAI “co-developed, brought up, and maintained our first three generations of supercomputers with great care and close collaboration with our partners over the span of a few years.” This experience underscored the need for a drastically simplified network design to efficiently utilize compute at such a massive scale.

Traditional networks struggle with the millions of data transfers involved in a single AI training step; a single delayed transfer can cause GPUs to sit idle, especially in synchronous pretraining where GPUs cooperate in lockstep. At large enough scale, even the best network will have a constant background level of link and switch failures, necessitating a more resilient approach. MRC addresses these issues through a multi-plane network topology. Instead of relying on single 800Gb/s links, the system splits interfaces into multiple smaller connections, creating parallel networks or “planes” operating at 100Gb/s. This allows for connection of over 131,000 GPUs with only two tiers of switches, reducing power consumption, component count, and overall cost compared to conventional designs.

However, simply increasing path diversity isn’t enough; MRC employs “adaptive packet spraying,” distributing packets across hundreds of paths to avoid congestion and utilize available bandwidth. MRC takes the packets from a single transfer and sprays them across hundreds of paths through our network, across all of the distinct planes, ensuring packets arrive out of order but are reassembled at the destination. This innovative approach, combined with load-balancing, enables detection of network failures and routing around them on a microsecond timescale, minimizing disruption to training jobs. MRC extends RDMA over Converged Ethernet, an InfiniBand Trade Association standard that enables hardware-accelerated remote direct memory access among GPUs and CPUs, drawing on techniques from the Ultra Ethernet Consortium and SRv6-based source routing.

Currently deployed across OpenAI’s NVIDIA GB200 supercomputers, including those at Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater facility, MRC has already been used to train multiple OpenAI models leveraging hardware from NVIDIA and Broadcom. The MRC specification is now available as an Open Compute Project contribution, allowing the broader community to benefit from this advancement in AI infrastructure.

MRC Enables Multi-Plane Networks for GPU Redundancy

OpenAI’s rapid expansion, fueled by the popularity of services like ChatGPT with more than 900 million people using it every week, has necessitated a fundamental overhaul of its supercomputer networking infrastructure, moving beyond previous generations of bespoke systems towards a more scalable and resilient design. The company’s new approach, centered around a protocol called Multipath Reliable Connection (MRC), isn’t simply about increasing bandwidth; it’s a strategic shift towards proactively mitigating the inherent challenges of maintaining performance across increasingly massive GPU clusters. Prior to developing Stargate, OpenAI refined its supercomputer builds over several years, gaining “invaluable experience” that underscored the need to reduce complexity in every layer of the computing stack, including network design. MRC addresses the critical issue of network congestion and failures, which become exponentially more problematic as cluster size increases. OpenAI identified two key challenges: minimizing congestion beyond unavoidable bottlenecks and minimizing the impact of failures on training jobs.

The core of MRC lies in its implementation of multi-plane networks, a topology that dramatically increases redundancy. Packets may arrive out of order, but the destination GPU reassembles them based on the included memory address. This technique actively avoids congestion hotspots, preventing performance bottlenecks that plague traditional AI training workloads. OpenAI released the MRC specification through the Open Compute Project, implying a shift towards fostering industry-wide standards for AI scaling, believing that shared standards in key infrastructure layers can help scale AI systems more efficiently, reliably, and across a broader partner ecosystem.

Adaptive Packet Spraying Eliminates Core Congestion

OpenAI’s relentless pursuit of scaling artificial intelligence training has driven a significant overhaul of its supercomputer networking, culminating in the development of Multipath Reliable Connection (MRC), a protocol designed to eliminate core congestion and bolster network resilience. With more than 900 million people using ChatGPT every week, our systems are becoming core infrastructure for AI, helping people and businesses around the world build with increasingly capable models. MRC achieves this by fundamentally altering how data packets traverse the network. This technique, visually demonstrated through animations created by Mark Handley, effectively avoids congestion hotspots and prevents individual transactions from experiencing disproportionately long delays, a critical factor for synchronous AI training workloads. OpenAI explains that by spreading traffic across many paths, MRC avoids hot-spots in the network, preventing some transactions from taking much longer than others, highlighting the protocol’s ability to maintain consistent performance.

The implementation of packet spraying is coupled with a sophisticated system for handling failures and congestion. If packet loss occurs, the system proactively assumes a failure and immediately retransmits data via alternative routes, with probe packets verifying the status of the retired path. MRC utilizes “packet trimming,” a technique where switches discard packet payloads during congestion, triggering explicit retransmission requests and reducing false positives. In contrast, conventional networks can require seconds to stabilize after a failure.

Static Source Routing Bypasses Network Failures

OpenAI’s ambitious scaling of artificial intelligence, driven by more than 900 million people using ChatGPT every week, has necessitated a fundamental rethink of supercomputer networking, moving beyond incremental improvements toward proactive resilience. While increasing network capacity is crucial, simply adding more bandwidth doesn’t address inherent vulnerabilities; the company’s new Multipath Reliable Connection (MRC) protocol directly tackles the issue of failure, employing a technique called static source routing to bypass disruptions and maintain consistent performance. This isn’t merely about speed, but about ensuring that the immense computational power of systems like Stargate isn’t undermined by unpredictable network behavior. MRC’s design philosophy centers on anticipating and circumventing failures rather than reacting to them. Traditional network protocols rely on switches to dynamically calculate routes, a process that introduces latency and potential instability when links or devices fail.

Instead, MRC utilizes “static source routing,” where the originating GPU dictates the complete path a packet will take. This pre-determined route allows for immediate bypass of known or suspected faulty components, eliminating entire classes of routing failures that plague conventional networks. As OpenAI explains, the system proactively avoids problematic paths, ensuring data reaches its destination with minimal delay, even amidst ongoing network events. This approach is particularly critical for synchronous pretraining, where even minor interruptions can cascade into significant performance losses across thousands of GPUs. The effectiveness of static source routing is amplified by MRC’s multi-plane network topology.

By splitting network interfaces into multiple smaller links, OpenAI has created a highly redundant system where data can be distributed across hundreds of paths. OpenAI details that if a packet is lost, the system assumes something on that path may have failed and immediately stops using it, retransmitting any packets that may have been lost. The result is a network that not only minimizes the impact of failures but also simplifies network management by reducing reliance on complex dynamic routing protocols, ultimately delivering models to users faster and more reliably.

MRC Deployment Across OpenAI & OCI Infrastructure

The assumption that scaling artificial intelligence demands ever more complex networking infrastructure is being challenged by OpenAI’s deployment of Multipath Reliable Connection (MRC) across its supercomputer network, including facilities co-located with Oracle Cloud Infrastructure (OCI) in Abilene, Texas. This isn’t merely about accommodating growth; it’s about sustaining current, massive usage levels with increasingly sophisticated AI. This can improve performance by connecting more traffic locally to Tier 0 switches. This allows OpenAI’s deployments to bypass failures and eliminate whole classes of routing failure. The protocol doesn’t assign a transfer to a single path; instead, it distributes packets across hundreds of paths, even if they arrive out of order, relying on destination hardware to reassemble them. OpenAI details that if the system detects that a path is becoming congested, it swaps that path for another one, evening out load across the network, emphasizing the system’s ability to dynamically adjust to changing conditions.

Failures are handled with similar speed; MRC assumes a lost packet indicates a problem and immediately reroutes, proactively probing for recovery. The company notes that with synchronous pretraining, where many GPUs across many computers cooperate in lockstep to train one AI model, this is especially true.

At meaningful scale, that reliability and efficiency is not a nice-to-have; it is part of what makes synchronous frontier model training possible.
OpenAI

Source: https://openai.com/index/mrc-supercomputer-networking/

Tags:

ChatGPT MRC OpenAI supercomputers

Dr. Donovan

900M ChatGPT Users Drive OpenAI Network Infrastructure Push

Stargate Supercomputer Design Drives MRC Development

MRC Enables Multi-Plane Networks for GPU Redundancy

Adaptive Packet Spraying Eliminates Core Congestion

Static Source Routing Bypasses Network Failures

MRC Deployment Across OpenAI & OCI Infrastructure

Latest Posts by Dr. Donovan:

OpenAI Ads Appear Below Relevant ChatGPT Conversations

$850M Post-Quantum Security Sector Grows 38% Annually

KIST Drives US Expansion for Quantum Deep-Tech Startups