The increasing size of modern artificial intelligence models presents a significant challenge, as limited GPU memory often restricts their deployment. Zixu Shen, Kexin Chu, and Yifan Zhang, along with colleagues at their institutions, address this problem with a new system called ExpertFlow, designed to improve the efficiency of Mixture-of-Experts (MoE) models. Current MoE approaches, which activate only parts of a model during use, frequently suffer from delays caused by transferring data between different types of memory. ExpertFlow overcomes this limitation by dynamically predicting which parts of the model will be needed next, and proactively loading them into faster memory. This adaptive approach, which considers factors like data transfer speeds and model size, dramatically reduces delays and allows MoE models to operate with greater speed and efficiency under tight memory constraints.
MoE models scale large language models by dividing the network into experts and routing input tokens to a subset of those experts, increasing model capacity without a proportional increase in computational cost. Realizing the benefits of MoE requires optimizing the inference process, addressing challenges such as communication overhead, load imbalance, and expert activation cost. ExpertFlow tackles these issues through optimized expert activation and token allocation, reducing communication, balancing load, and minimizing activation cost.
Related research demonstrates the breadth of work in this area, with foundational contributions establishing the sparsely-gated MoE layer. Subsequent work, including Switch Transformers and investigations into expert choice routing, has focused on efficient sparsity and expert selection. Optimization efforts, such as ProMoE and Hobbit, have targeted faster inference through proactive caching and mixed-precision expert offloading. This body of work highlights the importance of load balancing, communication reduction, and hardware considerations in achieving efficient MoE performance. In essence, this research presents a contribution to the growing field of MoE optimization, building upon existing techniques and introducing novel approaches to address the challenges of MoE inference. ExpertFlow aims to make MoE models practical and efficient for real-world applications by intelligently managing expert activation and token allocation.
Expert Prefetching Optimizes Mixture-of-Experts Inference
The research team developed ExpertFlow, a runtime system that accelerates Mixture-of-Experts (MoE) inference by intelligently managing memory access and expert activation. Recognizing that conventional MoE approaches suffer from latency due to frequent data transfers, scientists engineered a system that predicts future expert needs and proactively prefetches parameters. This work centers on dynamically adjusting a ‘step size’, representing the number of layers ahead for which experts are predicted and loaded, to minimize communication overhead and maximize GPU utilization. The initial step size is calculated based on factors including the number of experts to activate, expert size, device communication bandwidth, and per-layer compute time, establishing a baseline for efficient prefetching.
To achieve accurate predictions, the study pioneered a hybrid approach combining ‘pre-gating’ with a predictive mechanism that analyzes both token identifiers and recent expert activation states. Pre-gating passes hidden states to the next layer’s router, providing initial guidance, while analysis of token data and activation history refines the prediction, improving accuracy and reducing cache misses. Scientists compute cumulative Euclidean distances among tokens within each batch to estimate expected expert activations per layer, informing the dynamic adjustment of the step size. The system continuously monitors prediction accuracy and adjusts the step size accordingly; if predictions degrade, the step size increases to reduce reliance on uncertain forecasts, and when predictions remain stable, the step size decreases to maximize performance.
The research team implemented a feedback loop that monitors indicators like cache miss rate and wait time, enabling real-time adjustments to the step size and ensuring stable, responsive inference. This adaptive process prevents cumulative latency from sequential expert swapping, aligning expert activation with GPU memory availability and interconnect bandwidth. The system formally computes the step size using a precise mathematical formulation, demonstrating a clear optimization strategy.
ExpertFlow Dramatically Reduces MoE Model Stall Time
The research team developed ExpertFlow, a system designed to significantly improve the efficiency of Mixture-of-Experts (MoE) models, which are increasingly limited by GPU memory capacity. Experiments demonstrate that ExpertFlow reduces stall time to less than 0. 1% of the baseline, a substantial improvement achieved through adaptive expert prefetching and cache-aware routing. The core innovation lies in dynamically adjusting the prediction horizon for expert activation, leveraging runtime statistics such as transfer bandwidth, parameter dimensionality, and feedback signals to optimize performance.
Investigations revealed a non-monotonic relationship between batch size and latency; while increasing batch size initially improves throughput, performance plateaus and then declines after a certain point. Specifically, average waiting latency increases significantly, even as cache-miss latency decreases, demonstrating the complex interplay between workload scale, GPU memory pressure, and cache behavior. Researchers quantified intra-batch diversity using cumulative Euclidean distance among token embeddings, which correlates more strongly with expert demand than batch size alone. This metric provides a stable signal for anticipating expert reuse, swap pressure, and cache residency.
The team established that the expert loading cost can be approximated using a clear mathematical relationship, highlighting the importance of balancing prefetching latency against layer compute time. ExpertFlow incorporates a predictive mechanism utilizing both token identifiers and recent expert activation states, combined with a pre-gate strategy, to improve the accuracy of expert prefetching and sustain high GPU utilization. The system dynamically determines an adaptive step size for cross-layer expert activation, continuously updating it based on runtime conditions and feedback from cache miss rates and wait times.
Adaptive Expert Prefetching Eliminates Latency
ExpertFlow represents a significant advance in the efficient execution of Mixture-of-Experts models, addressing limitations imposed by constrained GPU memory capacity. The system dynamically adjusts expert prefetching based on runtime statistics, including data transfer bandwidth and parameter dimensionality, and incorporates a hybrid prediction scheme that integrates both prior information and intermediate computational states. This adaptive approach demonstrably reduces expert loading and waiting latency during inference, achieving a substantial reduction in stall time to less than 0. 1% of baseline performance.
Experiments confirm that ExpertFlow improves expert prediction accuracy by over 30% and eliminates up to 99. 9% of waiting latency while maintaining consistent performance across diverse hardware and workloads, highlighting its scalability for real-world applications. Future work could explore extending the system to handle even larger models and investigating alternative prediction strategies to further optimize performance and resource utilization.
👉 More information
🗞 ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference
🧠 ArXiv: https://arxiv.org/abs/2510.26730
