NVIDIA Blackwell Achieves Record Training Scale With 8,192 GPUs

NVIDIA has achieved the largest-scale training demonstrated in MLPerf Training 6.0, utilizing systems with 8,192 GPUs. The company’s Blackwell platform led across every category in the latest industry benchmarks, signifying increased infrastructure capacity as artificial intelligence models grow in size and complexity. This round of testing included two new mixture-of-experts (MoE) pretraining workloads, DeepSeek-V3 671B and GPT-OSS-20B, reflecting a shift toward these increasingly complex AI architectures, and NVIDIA was the only platform submitted across all seven benchmarks. NVIDIA combines performance, scale, and reliability in a single platform engineered to enable AI model builders to launch models faster, minimize training costs, and begin generating revenue sooner.

NVIDIA Blackwell Achieves Fastest Training Times on MLPerf 6.0

NVIDIA Blackwell scaled to 8,192 GPUs in MLPerf Training 6.0, marking the largest-scale training demonstrated in the benchmark’s history. The latest MLPerf Training 6.0 results demonstrate significant progress in AI infrastructure. These new MoE architectures, gaining prominence in the industry, present unique challenges in terms of communication between GPUs during training, challenges NVIDIA appears to have addressed effectively. NVIDIA leveraged its fifth-generation NVLink Switches, connecting 72 GPUs within each rack-scale system, to create a unified pool of compute and memory, according to the company. Beyond interconnectivity, NVIDIA also showcased NVFP4 training methods, increasing performance while maintaining accuracy across various workloads; it previously used this method to pretrain the 550-billion-parameter NVIDIA Nemotron 3 Ultra model. The GB300 NVL72 rack-scale system delivered up to 1.6 times faster training compared to the GB200 NVL72 at the same scale, driven by higher compute density, expanded memory, and increased power capacity.

The scale of NVIDIA’s Blackwell deployment was further demonstrated through partnerships; Microsoft Azure scaled Llama 3.1 405B training to 8,192 GPUs using GB200 NVL72 systems, and reached the reference quality target in 7.07 minutes. Similarly, CoreWeave achieved the fastest time for DeepSeek-V3 671B, reaching the quality target in 2.02 minutes at 8,192-GPU scale using GB300 NVL72 systems connected with Spectrum-X Ethernet networking. The company states that for resiliency, NVIDIA’s platform is engineered across two dimensions, emphasizing the importance of both preventing failures and ensuring rapid recovery when they inevitably occur, crucial for long-duration training runs.

NVLink and NVFP4 Enhance Large-Scale MoE Performance

The current push for ever-larger artificial intelligence models is fundamentally reshaping demands on training infrastructure, with mixture-of-experts (MoE) architectures becoming increasingly prominent. These complex models, characterized by routing computations across different “expert” subnetworks, present significant communication challenges; however, NVIDIA’s latest advancements in interconnect technology and numerical precision are directly addressing these hurdles. MLPerf Training 6.0 provides a standardized measure of these advancements. A key enabler of large-scale MoE training is fifth-generation NVLink Switches, which connect GPUs within rack-scale systems into a unified compute and memory pool. This high-bandwidth interconnect is crucial for efficiently routing tokens across GPUs to the appropriate expert, a process that becomes increasingly demanding as model size scales. to connectivity, NVIDIA has also focused on optimizing the precision of calculations with NVFP4 training methods. This improvement in performance is consistent; NVIDIA achieved the fastest training times across every benchmark in the MLPerf Training 6.0 suite, showcasing leadership.

GB300 NVL72 Delivers Performance Gains Over GB200 NVL72

Microsoft Azure and CoreWeave are demonstrating the practical benefits of NVIDIA’s latest hardware advancements, achieving record-breaking training times with the Blackwell platform. Azure scaled Llama 3.1 405B training to 7.07 minutes, reaching the reference quality target for this benchmark. CoreWeave delivered the fastest time to train for DeepSeek-V3 671B, reaching the quality target in 2.02 minutes at the same 8,192-GPU scale. These results highlight not only raw performance but also the effectiveness of NVIDIA’s co-engineering efforts with key partners in optimizing system architecture, networking, and software. The gains achieved by the GB300 NVL72 over the GB200 NVL72 are substantial; in this MLPerf Training 6.0 round, the GB300 NVL72 delivered up to 1.6 times faster training at the same scale.

This improvement stems from key Blackwell Ultra capabilities, including higher compute density with NVFP4, expanded memory capacity, and an increased power ceiling that allows the GPU to sustain peak performance for extended periods. NVIDIA also showcased NVFP4 training methods that “increase performance while meeting strict accuracy requirements across large- and small-scale pretraining as well as fine-tuning workloads,” demonstrating a commitment to precision alongside speed. Beyond speed, NVIDIA emphasizes the reliability of its platform, crucial for training runs that can span weeks or months. The company’s approach to resiliency is built on minimizing interruptions through rigorous GPU screening, with over 30 manufacturing test stages, and employing a Reliability, Availability and Serviceability Engine that monitors chip health and automatically reroutes around detected faults. NVIDIA Resiliency Extension, or NVRx, minimizes downtime by resuming training from recent checkpoints when interruptions occur, rather than restarting the entire process. This combination of performance and stability positions NVIDIA as a key enabler for the next generation of frontier AI models.

Quantum InfiniBand & Spectrum-X Enable 8,192-GPU Scale

The ability to scale AI model training to unprecedented levels is now demonstrably achievable, as evidenced by recent MLPerf Training 6.0 results; NVIDIA successfully trained models across 8,192 GPUs, a feat highlighting a significant leap in infrastructure capacity. This scale wasn’t simply about quantity, but about maintaining performance and reliability while distributing workloads across a massive cluster, a challenge addressed through NVIDIA’s networking solutions. The company offers both NVIDIA Quantum InfiniBand and NVIDIA Spectrum-X Ethernet, providing data centers with options to optimize large-scale clusters for their specific needs. On the DeepSeek-V3 671B benchmark, the largest MoE model in the suite, NVIDIA utilized GB200 NVL72 systems to reach this 8,192-GPU scale, marking the largest-scale Blackwell-based submission in MLPerf Training to date. Microsoft Azure also demonstrated the power of this approach, scaling Llama 3.1 405B training to 8,192 GPUs using GB200 NVL72 systems, and reached the reference quality target in 7.07 minutes.

CoreWeave further showcased the capabilities of NVIDIA’s networking technology, delivering the fastest time to train for DeepSeek-V3 671B, reaching the quality target in 2.02 minutes at 8,192-GPU scale using GB300 NVL72 systems connected with Spectrum-X Ethernet networking. This isn’t merely about speed; the platform is engineered for sustained operation. These combined features demonstrate a commitment to building infrastructure capable of supporting the demands of increasingly complex and resource-intensive AI models.

On DeepSeek-V3 671B, the largest MoE model in the suite, NVIDIA scaled its submission to 8,192 GPUs using GB200 NVL72 systems, the largest-scale Blackwell-based submission in MLPerf Training to date.

NVIDIA
Stay current. See today’s quantum computing news on Quantum Zeitgeist for the latest breakthroughs in qubits, hardware, algorithms, and industry deals.
Avatar of The Neuron

The Neuron

With a keen intuition for emerging technologies, The Neuron brings over 5 years of deep expertise to the AI conversation. Coming from roots in software engineering, they've witnessed firsthand the transformation from traditional computing paradigms to today's ML-powered landscape. Their hands-on experience implementing neural networks and deep learning systems for Fortune 500 companies has provided unique insights that few tech writers possess. From developing recommendation engines that drive billions in revenue to optimizing computer vision systems for manufacturing giants, The Neuron doesn't just write about machine learning—they've shaped its real-world applications across industries. Having built real systems that are used across the globe by millions of users, that deep technological bases helps me write about the technologies of the future and current. Whether that is AI or Quantum Computing.

Latest Posts by The Neuron: