Modes Accelerates Mixture-of-Experts Multimodal Large Language Models, Achieving 88% Efficiency with 97.33% Accuracy

The increasing complexity of multimodal large language models, designed to process both images and text, often comes at the cost of computational efficiency. Researchers Yushi Huang from Hong Kong University of Science and Technology, Zining Wang and Zhihang Yuan from Beihang University, alongside Yifu Ding, Ruihao Gong and Jinyang Guo, address this challenge with a novel approach to expert skipping. They demonstrate that existing methods, originally developed for text-only models, significantly reduce performance when applied to multimodal systems because they fail to account for the unique contributions of different experts and the varying behaviours of visual and textual data. To overcome this limitation, the team introduces MoDES, a training-free framework that dynamically skips redundant experts, enabling faster and more accurate inference. Through extensive testing across multiple benchmarks, MoDES consistently outperforms previous methods, achieving substantial performance gains and accelerating both the initial processing and ongoing generation of responses.

Dynamic Expert Skipping for Multimodal Models

Scientists have developed MoDES, a new method to improve the efficiency of large multimodal models without sacrificing accuracy. MoDES intelligently selects which parts of the model, called experts, to activate for each input, reducing computational cost. The core idea is to dynamically determine the most relevant experts based on the input’s type and content. Researchers demonstrated MoDES’s effectiveness through extensive experiments and visual analyses. MoDES achieves this by dynamically selecting experts to activate, a significant improvement over methods that use static or random selection.

The method recognizes that different types of data benefit from different levels of expert activation, with images allowing for more aggressive skipping due to inherent redundancy. This results in improved efficiency and consistently outperforms existing models on various multimodal benchmarks. Researchers found that image data contains more redundancy among experts, allowing for more aggressive skipping compared to text data. MoDES offers a promising approach for deploying these powerful models in environments with limited computational resources.

Globally-Modulated Gating for Efficient Multimodal Inference

Researchers have created MoDES, a novel framework that enhances the efficiency of Mixture-of-Experts (MoE) Multimodal Large Language Models (MLLMs) during inference while maintaining high accuracy. Recognizing that existing expert skipping methods performed poorly on MLLMs, the team conducted an in-depth analysis revealing critical differences in how experts contribute across layers and modalities. They identified that experts in shallower layers play a more critical role than those in deeper layers, and that experts have a larger effect on updating text data compared to image data. To address these insights, scientists engineered a globally-modulated local gating (GMLG) mechanism.

This technique combines global layer-specific importance, determined through offline calibration, with local routing probabilities to accurately estimate expert importance. They then implemented a dual-modality thresholding (DMT) method, processing text and image data separately to derive a skipping schedule tailored to each modality. To efficiently determine optimal thresholds for expert skipping, the researchers developed a frontier search algorithm, reducing the search time from several days to just a few hours. Experiments across three model series and 13 benchmarks demonstrate MoDES’s superior performance.

For example, when skipping 88% of experts, the team achieved a performance boost of up to 10. 67%, reaching 97. 33% accuracy. Furthermore, MoDES significantly accelerated inference speed, improving prefilling time by 2. 16× and decoding time by 1. 26×. These results demonstrate that MoDES effectively balances computational efficiency with model accuracy, offering a substantial advancement in MLLM inference.

Adaptive Expert Skipping Boosts Multimodal LLMs

Researchers have developed MoDES, a new framework that significantly improves the efficiency of Mixture-of-Experts (MoE) Multimodal Large Language Models (MLLMs) without sacrificing accuracy. MoE models, while powerful for vision-language tasks, often suffer from computational bottlenecks during inference due to activating all model parameters for each data point. MoDES addresses this by adaptively skipping redundant experts, reducing computational cost while maintaining performance. Experiments demonstrate that MoDES outperforms existing expert skipping methods, achieving performance boosts of up to 10.

67% when skipping 88% of experts, resulting in an accuracy of 97. 33%. The breakthrough lies in recognizing that experts contribute differently across layers and modalities within MLLMs, a factor overlooked by previous methods. To address this, the team introduced a globally-modulated local gating (GMLG) mechanism, which combines layer-specific importance with local routing probabilities to accurately estimate expert importance. A dual-modality thresholding (DMT) method then selectively skips experts based on the modality of the input data, further enhancing efficiency.

The team also developed a novel frontier search algorithm to determine optimal thresholds for expert skipping, reducing the search time from several days to just a few hours. Measurements confirm substantial improvements in inference speed, with prefilling time improved by 2. 16× and decoding time by 1. 26×. Extensive testing across 3 model series and 13 benchmarks consistently showed MoDES surpassing state-of-the-art methods, delivering significant performance gains even with extremely high expert skipping ratios exceeding 80%, while retaining over 95% of the original model’s accuracy. These results demonstrate MoDES’s potential to accelerate MLLM inference and enable more efficient processing of complex vision-language tasks.

Adaptive Expert Skipping in Multimodal Models

Researchers have developed MoDES, a new framework designed to improve the efficiency of Mixture-of-Experts (MoE) multimodal large language models. Their work addresses the computational demands of these models by adaptively skipping redundant experts during inference, a process previously found to degrade performance when applied directly to multimodal systems. The team identified that expert contributions vary across layers and that different modalities, such as vision and text, exhibit distinct behaviours within these models, necessitating a more nuanced approach to expert selection. MoDES incorporates a globally-modulated local gating mechanism and a dual-modality thresholding method, allowing the model to skip experts based on layer-specific importance and the characteristics of each modality.

This approach ensures that the model retains its strong performance while significantly reducing computational costs, demonstrated through improvements in both prefilling and decoding speeds. Extensive experiments across multiple benchmarks show substantial performance gains when skipping a large percentage of experts. The authors acknowledge that the optimal thresholds for expert skipping require careful tuning, but they have also developed an efficient frontier search algorithm to streamline this process. Future work could explore the application of MoDES to other MoE architectures and investigate the potential for further optimization of the thresholding process. The team’s findings highlight the importance of considering modality-specific behaviours and layer-wise contributions when designing efficient inference strategies for complex multimodal models.

👉 More information
🗞 MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
🧠 ArXiv: https://arxiv.org/abs/2511.15690

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

New Superconductor Design Unlocks Potential for Vastly More Powerful Quantum Computers

New Superconductor Design Unlocks Potential for Vastly More Powerful Quantum Computers

February 9, 2026
New Quantum Material Unlocks Exotic Metallic States in Three Dimensions

New Quantum Material Unlocks Exotic Metallic States in Three Dimensions

February 9, 2026
Quantum ‘walls’ Halt Information Spread, Revealing New Rules for Causality

Quantum ‘walls’ Halt Information Spread, Revealing New Rules for Causality

February 9, 2026