Yesterday, OpenAI officially announced that it has joined forces with five industry giants, including AMD, Broadcom, Intel, Microsoft, and NVIDIA, to jointly launch the Multipath Reliable Connection (MRC) protocol. This protocol aims to address network latency and faults that are common in large-scale AI training and has been open-sourced to the global industry through the Open Compute Project (OCP).

image.png

Breaking "Single Point of Failure": A Leap from Three-Tier Architecture to Two-Tier Design

In traditional AI model training processes, network congestion or minor failures on a single link often act like falling dominoes, causing tens of thousands of GPUs to idle and wait, resulting in significant computing power waste.

To fundamentally enhance system resilience, the MRC protocol introduces a multi-plane network design. It cleverly splits a single 800Gb/s interface into multiple smaller links. Through this structural optimization, the system can support a massive cluster of approximately 131,000 GPUs with just two layers of switches. Compared to traditional two-tier or four-tier architectures, this change not only significantly reduces the number of physical components and energy consumption but also lowers construction costs considerably.

New Traffic Scheduling: Data Packet "Spraying" and Microsecond-Level Self-Healing

In addition to architectural simplification, MRC also presents a new approach to traffic distribution. It adopts adaptive data packet spraying technology, breaking away from traditional single-path transmission. It disperses task packets and distributes them across hundreds of paths for parallel transmission. Even if packets arrive out of order during transmission, the receiving end can accurately reassemble them, effectively avoiding local congestion in the core network.

In terms of network control, MRC discards complex dynamic routing protocols (such as BGP), instead adopting SRv6 source routing technology. This means the sender can directly specify the path, and the switch only needs to perform simple static forwarding. This design compresses the network fault recovery time from "seconds" to "microseconds," allowing the system to almost achieve "seamless self-healing" when facing link jitter.

Field Testing: The "Anti-Shake" Tool for Supercomputers

Currently, the MRC protocol has been applied in NVIDIA's GB200 supercomputer and Oracle's cloud infrastructure. Test data shows that even under real training scenarios, when faced with sudden situations such as link jitter or switch restarts, MRC can automatically bypass the fault points and ensure complex training tasks remain uninterrupted.