Mixture-of-Experts (MoE) models represent a promising path towards scaling artificial intelligence, activating only specific parts of the network to reduce computational demands, but current optimisation techniques offer limited gains in real-world performance. Krishna Teja Chitty-Venkata, Sandeep Madireddy, Murali Emani, and Venkatram Vishwanath, all from Argonne National Laboratory, address this challenge with LExI, a novel method that dynamically adjusts the number of active experts within each layer of a pre-trained MoE model. Their research demonstrates that simply reducing the model’s size does not necessarily translate to faster processing, and that a more nuanced approach is required to unlock the full potential of MoE architectures. By intelligently allocating resources based on the importance of each layer, LExI achieves significant improvements in inference efficiency, with experiments showing that Qwen1. 5-MoE maintains the same processing speed while improving accuracy by ten percent on powerful hardware. This work represents a crucial step towards deploying large, efficient MoE models for a wider range of applications.
Layerwise Top-k Optimisation for Mixture-of-Experts
Researchers are refining how Mixture-of-Experts (MoE) models, a powerful type of large language model, balance performance and efficiency. These models activate only a portion of their parameters, but determining the optimal number of active experts, the ‘top-k’ value, has proven challenging. This work introduces a method that dynamically adjusts the top-k value for each layer of the model, rather than applying a single value across the entire network. This approach aims to maximise performance while minimising computational cost, offering a more nuanced control over the trade-off between speed and accuracy.
Evaluations across a diverse set of MoE models, including DeepSeek, OLMoE, and Mixtral, demonstrate the effectiveness of this layer-wise optimisation. Analysis reveals that different layers exhibit varying sensitivities to changes in the top-k value; some layers are less affected by reductions, while others require more experts to maintain performance. This understanding allows the method to intelligently allocate resources, improving overall efficiency without sacrificing accuracy. By activating fewer experts per layer, the method reduces computational cost and speeds up inference, paving the way for faster and more resource-efficient large language models.
Dynamic Expert Capacity Optimisation for MoE Models
Researchers have developed LExI, a new technique for optimising Mixture-of-Experts (MoE) models without requiring any training data. LExI addresses limitations of traditional pruning methods by dynamically adjusting the number of active experts per layer, recognising that different layers have varying computational needs. The method analyses pretrained model weights to estimate the relative importance of each layer, then adaptively assigns the number of active experts accordingly. This layer-adaptive allocation contrasts with existing methods that rely on dataset-driven pruning or routing adjustments, making LExI suitable for deployment settings where data access is limited.
Experiments demonstrate that LExI significantly outperforms traditional expert pruning, achieving a 10% improvement in accuracy on the Qwen1. 5-MoE model while maintaining the same throughput on H100 GPUs. By reducing the average number of activated experts per layer, LExI minimises both latency and memory bandwidth usage, offering a more efficient alternative to fixed top-k routing. The method is designed as a simple, plug-and-play solution that can be readily integrated into various inference frameworks, providing a versatile tool for optimising MoE models across diverse tasks and architectures. The team also strategically minimised the total number of activated experts across the entire model, reducing communication volume and improving overall performance.
LexI Optimizes MoE Models Without Pruning
Researchers have developed LExI, a novel data-free optimisation technique that significantly enhances the efficiency of Mixture-of-Experts (MoE) models without sacrificing accuracy. LExI addresses limitations of traditional post-training optimisations like pruning by intelligently determining the optimal number of active experts per layer within a pre-trained MoE model, adapting the allocation based on the relative importance of each layer. Experiments across state-of-the-art language and vision models demonstrate that LExI consistently outperforms traditional pruning methods. For example, using LExI, the Qwen1.
5-MoE model achieves the same throughput as traditional expert pruning, but with a 10% improvement in accuracy. On the OLMoE-1B-7B model, LExI achieves a 10% higher accuracy than 50% intra-expert pruning while maintaining equivalent throughput. Across multiple models, including MiniCPM-MoE and Mixtral-8x7B, LExI delivers accuracy improvements ranging from 6. 5% to 15% at comparable throughput levels. The team’s approach navigates the complex search space of expert allocation without requiring access to the model itself during optimisation, making it particularly well-suited for large-scale models. Evaluations on long-context tasks, such as the Qasper dataset, reveal that LExI maintains or improves both F1 scores and throughput, demonstrating its ability to preserve performance on complex reasoning tasks.
LexI Optimises Experts, Boosts Throughput and Accuracy
This research introduces LExI, a novel method for optimising Mixture-of-Experts (MoE) models, which are increasingly popular due to their efficient scaling capabilities. LExI determines the optimal number of active experts per layer within a pretrained MoE model, improving computational efficiency without significant accuracy loss. The method achieves this by analysing model weights to assess layer importance and adaptively assigning the number of experts accordingly, representing a data-free post-training optimisation technique. Experiments across multiple language and vision MoE models demonstrate that LExI consistently outperforms traditional pruning methods, delivering substantial throughput gains while maintaining, or even improving, accuracy. For instance, on several models, LExI matches the throughput of heavily pruned models but with notably better accuracy, and in some cases, even surpasses the performance of the original unpruned model. Importantly, LExI requires no retraining or calibration data, making it a practical and efficient inference-time optimisation technique that reduces latency and resource usage.
👉 More information
🗞 LExI: Layer-Adaptive Active Experts for Efficient MoE Model Inference
🧠 ArXiv: https://arxiv.org/abs/2509.02753
