The increasing scale of Mixture of Experts (MoE) models presents significant challenges for deployment, requiring distributed systems spanning multiple GPUs and nodes. Bowen Zhou, Jinrui Jia, and Wenhao He from Southeast University, alongside Yong Zhang from The Chinese University of Hong Kong and Fang Dong from Southeast University, address this issue with their new system, MixServe. Their research focuses on overcoming communication bottlenecks inherent in current distributed MoE serving systems, which typically rely on either tensor parallelism or expert parallelism , each with its own limitations in efficiency and load balancing. MixServe introduces a novel automatic system employing a hybrid parallelism strategy, fusing all-reduce and all-to-all communication algorithms to optimise performance based on network and hardware configurations. Experiments utilising DeepSeek-R1 and Qwen3 demonstrate substantial improvements in inference speed, achieving up to a 50.3% increase in throughput and significant reductions in both time to first token and inter-token latency compared to existing methods.
Experiments utilising DeepSeek-R1 and Qwen3 demonstrate substantial improvements in inference speed, achieving up to a 50.3% increase in throughput and significant reductions in both time to first token and inter-token latency compared to existing methods.
Mixture Method
The deployment of large language models (LLMs) employing a Mixture of Experts (MoE) architecture is often hampered by memory limitations, necessitating distributed systems utilising multiple GPUs and nodes. This work introduces MixServe, a novel automatic distributed serving system designed to overcome communication bottlenecks inherent in these systems. Researchers engineered a hybrid tensor parallelism (TP) and expert parallelism (EP) strategy, underpinned by a fused all-reduce (AR) and all-to-all (A2A) communication algorithm, to optimise performance.
MixServe operates in two distinct stages: an offline analysis phase and an online serving phase. The offline stage begins with the system profiling model behaviour using varying batch sizes and sequence lengths, gathering observational data. Simultaneously, the system assesses network and hardware configurations, including computational power and both intra- and inter-node bandwidth. This data is then fed into an automatic analyser which determines the optimal parallel strategy for the model, providing crucial input for subsequent weight loading and partitioning. During the online stage, MixServe loads and partitions model weights according to the strategy derived offline. The team injected collective communication operators directly into the model’s forward method, utilising mixed parallel communication groups.
This innovative approach overlaps intra-node AR communication with inter-node A2A communication, reducing latency. The serving service itself leverages the vLLM system, managing memory and scheduling requests. To facilitate comprehensive investigation, the researchers defined a context-free grammar to represent parallel strategies for each decoder layer, allowing for orthogonal and complementary combinations of TP, EP, and data parallelism (DP). A detailed analysis of collective communication operators was undertaken, focusing on the overhead associated with different parallel strategies.
The study revealed that AR communication, decomposed into reduce-scatter (RS) and all-gather (AG) operations, exhibits a communication volume of O(bs·h d) per round. Experiments conducted on DeepSeek-R1 and Qwen3 demonstrated that MixServe achieves significant performance gains, delivering 1.08 to 3.80times acceleration in time to first token (TTFT), 1.03 to 1.66times acceleration in inter-token latency (ITL), and a 5.2% to 50.3% improvement in throughput compared to existing methods. This demonstrates the efficacy of the fused communication algorithm and the automated parallel strategy selection in optimising MoE deployment.
MixServe Accelerates MoE Inference with Hybrid Parallelism
Scientists have developed MixServe, a novel automatic distributed serving system designed for the efficient deployment of Mixture of Experts (MoE) large language models. The research addresses a critical bottleneck in distributed systems , communication overhead, particularly inter-node communication , which limits the scalability of MoE models with billions or trillions of parameters. MixServe introduces a TP-EP hybrid parallelism, leveraging a fused All-Reduce (AR) and All-to-All (A2A) communication algorithm to optimise performance.
Experiments conducted on the DeepSeek-R1 and Qwen3 models demonstrate substantial improvements in inference speed. The team measured a 1.08 to 3.80x acceleration in time to first token (TTFT), indicating a significantly faster initial response time. Inter-token latency (ITL) was also improved, with acceleration ranging from 1.03 to 1.66x, demonstrating quicker processing of subsequent tokens. Throughput measurements revealed a 5.2% to 50.3% improvement compared to existing approaches, signifying a greater volume of processed data per unit of time.
Detailed analysis of communication overhead using DeepSeek-R1 and Qwen3 models with varying degrees of parallelism revealed key insights. The study quantified latency for both AR and A2A operators, showing that communication within a node remained low, but increased substantially when exceeding eight parallel devices due to limitations in inter-node network bandwidth. Data shows that with a parallel degree of 32, AR-based tensor parallelism (TP) performed worse than expert parallelism (EP). Measurements confirmed that intra-node communication, utilising four Network Processing Units (NPUs), exhibited lower latency than inter-node communication across four nodes with a single NPU each.
Further profiling of a DeepSeek-R1 decoder layer, visualised through a Gantt chart, demonstrated the benefits of decoupling intra-node and inter-node communication. The team recorded that integrating TP within nodes assisted the EP component, reducing communication demands and improving overall efficiency. Specifically, the work highlights how the fused AR-A2A algorithm overlaps intra-node AR communication with inter-node A2A communication, optimising resource utilisation and delivering a breakthrough in MoE model serving.
MixServe Accelerates Large Language Model Serving
MixServe, a novel automatic distributed serving system, has been developed to efficiently deploy Mixture of Experts models. The system achieves this through a hybrid tensor parallelism-expert parallelism approach, underpinned by a fused algorithm for all-reduce and all-to-all communication. MixServe dynamically selects the most effective parallel strategy based on model parameters and network configurations, optimising communication overhead and enhancing performance.
Evaluations utilising DeepSeek-R1 and Qwen3 models demonstrate significant improvements in serving large language models. Specifically, the system delivers acceleration in time to first token, inter-token latency, and throughput compared to existing methods, achieving gains of up to 3.80x, 1.66x, and 50.3% respectively. The authors acknowledge that their work concentrates on parallel strategies and communication optimisation, and can be integrated with other LLM serving techniques such as request scheduling and data disaggregation. Future research could explore extending the system’s capabilities to a wider range of model architectures and hardware platforms, though the current implementation represents a substantial advance in efficient MoE deployment.
👉 More information
🗞 MixServe: An Automatic Distributed Serving System for MoE Models with Hybrid Parallelism Based on Fused Communication Algorithm
🧠 ArXiv: https://arxiv.org/abs/2601.08800
