Moonshot AI’s Kimi K2 Thinking, alongside DeepSeek AI’s DeepSeek-R1 and Mistral AI’s Mistral Large 3, demonstrates a significant advancement in artificial intelligence through the implementation of a mixture-of-experts (MoE) architecture. These models, recognized as the top 10 most intelligent open-source options, achieve a 10x performance increase when deployed on NVIDIA GB200 NVL72 rack-scale systems. This MoE approach, which divides work among specialized “experts” activated by a router, delivers faster, more efficient token generation without a proportional increase in compute—establishing MoE as the architecture of choice for frontier AI models.

Mixture-of-Experts Architecture in Frontier AI Models

The top 10 most intelligent open-source AI models currently utilize a mixture-of-experts (MoE) architecture. This design mirrors the human brain by dividing work among specialized “experts,” activating only those relevant to each AI token. Models like Kimi K2 Thinking, DeepSeek-R1, and Mistral Large 3 see a 10x performance increase on NVIDIA GB200 NVL72 systems. This approach allows for faster, more efficient token generation without a proportional increase in computing power, making it ideal for scaling AI capabilities.

MoE models achieve higher intelligence without a matching rise in computational cost by selectively engaging only the most relevant experts. While traditional models used all parameters for every token, MoE models, even with hundreds of billions of parameters overall, utilize only a subset – often tens of billions – per token. This has led to a nearly 70x increase in model intelligence since early 2023, and nearly all leading frontier models now employ this architecture, with over 60% of open-source releases adopting it this year.

NVIDIA’s GB200 NVL72 system addresses key MoE scaling bottlenecks. By distributing experts across up to 72 GPUs connected via NVLink, it reduces parameter-loading pressure on individual GPUs and accelerates expert communication with 130 TB/s of connectivity. This extreme codesign, along with software optimizations like the NVIDIA Dynamo framework and NVFP4 format, allows MoE models to scale expert parallelism far beyond previous limits, enabling faster and more efficient AI inference.

How MoE Models Enhance Intelligence and Efficiency

Mixture-of-Experts (MoE) models are enhancing AI intelligence and efficiency by mimicking the human brain. Rather than activating all parameters for every task, MoE models divide work among specialized “experts,” activating only those relevant to each AI token. This selective activation allows for higher intelligence and adaptability without a proportional increase in computational cost, delivering more intelligence per unit of energy and capital invested. Over 60% of open-source AI model releases this year utilize this architecture.

Scaling MoE models traditionally faced bottlenecks related to memory limitations and latency during expert communication. The NVIDIA GB200 NVL72 addresses these challenges with 72 Blackwell GPUs interconnected via NVLink, creating a system that functions as one. This design reduces the number of experts per GPU and accelerates communication, allowing expert parallelism to scale far beyond previous limits, distributing experts across a larger set of up to 72 GPUs.

The GB200 NVL72’s extreme codesign resolves MoE scaling bottlenecks by minimizing parameter-loading pressure on GPU memory and accelerating expert communication via NVLink. This allows for a system delivering 130 TB/s of NVLink connectivity. Further optimizations, such as the NVIDIA Dynamo framework and NVFP4 format, unlock high inference performance for MoE models, enabling greater efficiency and scalability for demanding AI applications.

Our pioneering work with OSS mixture-of-experts architecture, starting with Mixtral 8x7B two years ago, ensures advanced intelligence is both accessible and sustainable for a broad range of applications.
Guillaume Lample, cofounder and chief scientist at Mistral AI

Scaling MoE Models Presents Unique Challenges

Scaling Mixture of Experts (MoE) models presents challenges related to memory limitations and latency when distributing experts across multiple GPUs. Previously, spreading experts beyond eight GPUs required communication over slower networking, hindering the benefits of expert parallelism. The NVIDIA GB200 NVL72 system addresses this with 72 Blackwell GPUs connected by NVLink, creating a single, massive interconnect fabric offering 130 TB/s of connectivity. This allows for distributing experts across a much larger set of GPUs, resolving scaling bottlenecks.

The GB200 NVL72’s design directly tackles MoE scaling issues by reducing the number of experts each GPU needs to handle. Distributing experts across up to 72 GPUs minimizes pressure on high-bandwidth memory. Additionally, the NVLink Switch accelerates communication between experts, allowing for near-instantaneous exchange of information. This setup enables greater expert parallelism and supports more concurrent users with longer input lengths, enhancing overall performance.

NVIDIA’s full-stack optimizations, including the Dynamo framework and NVFP4 format, further unlock MoE’s potential. Open-source inference frameworks like TensorRT-LLM, SGLang, and vLLM support these optimizations. Specifically, SGLang has been instrumental in validating techniques used with the GB200 NVL72, helping to mature large-scale MoE deployments. Cloud providers like Amazon Web Services and Google Cloud are deploying the GB200 NVL72 to bring this MoE performance to enterprises.

With GB200 NVL72 and Together AI’s custom optimizations, we are exceeding customer expectations for large-scale inference workloads for MoE models like DeepSeek-V3.
Vipul Ved Prakash, cofounder and CEO of Together AI

NVIDIA GB200 NVL72: Resolving MoE Scaling Bottlenecks

The NVIDIA GB200 NVL72 addresses MoE scaling bottlenecks through extreme codesign, integrating 72 Blackwell GPUs as a unified system. This delivers 1.4 exaflops of AI performance alongside 30TB of fast shared memory. A key benefit is the NVLink Switch, providing 130 TB/s of connectivity, enabling rapid communication between GPUs. This architecture allows distribution of experts across up to 72 GPUs, minimizing parameter-loading pressure on individual GPU memory and increasing concurrent user capacity.

Scaling MoE models previously faced challenges with memory limitations and latency due to expert communication. The GB200 NVL72 resolves this by reducing the number of experts per GPU, lessening the load on high-bandwidth memory. Furthermore, instant communication via NVLink accelerates expert collaboration. NVIDIA’s Dynamo framework and software like TensorRT-LLM, SGLang, and vLLM further optimize performance, enabling disaggregated serving and large-scale expert parallelism.

The Kimi K2 Thinking MoE model, ranked highest on the Artificial Analysis leaderboard, achieves a 10x performance increase on the GB200 NVL72 compared to NVIDIA H200. This breakthrough extends to models like DeepSeek-R1 and Mistral Large 3, showcasing MoE’s growing dominance in frontier AI. Cloud providers including Amazon Web Services, Google Cloud, and Microsoft Azure are deploying GB200 NVL72 to support these advanced AI workloads.

Deployment of GB200 NVL72 by Cloud Providers

The NVIDIA GB200 NVL72 is a rack-scale system designed to overcome scaling bottlenecks in Mixture-of-Experts (MoE) models. It features 72 NVIDIA Blackwell GPUs working as one, delivering 1.4 exaflops of AI performance and 30TB of fast shared memory. This system utilizes an NVLink Switch providing 130 TB/s of connectivity, enabling faster communication between GPUs. By distributing experts across up to 72 GPUs, the GB200 NVL72 reduces memory pressure and accelerates communication, allowing for greater scalability than previous systems like the H200.

Several leading cloud providers are deploying the GB200 NVL72, including Amazon Web Services, Core42, Google Cloud, and Microsoft Azure. This deployment allows their customers to leverage the system’s capabilities for running MoE models in production. CoreWeave notes that its customers are already using the platform for agentic workflows, benefiting from the combined performance, scalability, and reliability. DeepL is also utilizing the Blackwell NVL72 design to build and deploy next-generation AI models.

MoE models, used in the top 10 most intelligent open-source models like Kimi K2 Thinking and DeepSeek-R1, have seen significant performance gains on the GB200 NVL72. Kimi K2 Thinking, for instance, achieves a 10x performance leap compared to NVIDIA HGX H200. The system’s architecture and optimizations, including the NVIDIA Dynamo framework and NVFP4 format, address key limitations in MoE scaling, allowing for more efficient and powerful AI inference.

Source: https://blogs.nvidia.com/blog/mixture-of-experts-frontier-models/?ncid=ref-em-310140

Tags:

AI models Deep Learning Mixture Of Experts NVIDIA Open Source

Quantum News

Kimi K2 Thinking Leads AI Breakthrough on NVIDIA Platform

Mixture-of-Experts Architecture in Frontier AI Models

How MoE Models Enhance Intelligence and Efficiency

Scaling MoE Models Presents Unique Challenges

NVIDIA GB200 NVL72: Resolving MoE Scaling Bottlenecks

Deployment of GB200 NVL72 by Cloud Providers

Latest Posts by Quantum News:

Pasqal Event to Explore Transition of Quantum Computing to Business Impact

Xanadu Advances Toward Public Listing with SEC Effectiveness of Registration Statement

Arq Quantum Technologies Launches to Develop Quantum Repeater Technology