Researchers are tackling the challenge of efficiently pruning large language models (LLMs) with a novel approach to weight reordering. Zekai Li, Ji Liu, and Guanchen Li, all from Advanced Micro Devices, Inc., alongside Yixing Xu et al., present a learnable permutation framework designed to optimise structured sparsity. This work addresses a key limitation of existing methods, the computationally prohibitive search space for optimal weight permutations, by introducing a differentiable system that learns the best reordering directly. By quantifying the cost of swapping weights and employing a bipartite matching solver, their technique demonstrably improves post-pruning performance on transformer models and achieves state-of-the-art results, potentially paving the way for more efficient and deployable LLMs.

Learning weight permutations for efficient Transformer sparsity enables faster and more memory-friendly inference

Scientists have developed a novel, end-to-end learnable framework for optimising structured sparsity in Transformer models, achieving state-of-the-art results in both vision and language tasks. The research addresses a key limitation in model pruning, the mismatch between rigid sparsity patterns and the inherent distribution of weights within complex architectures like Transformers.
This work introduces a method to reorder weights before pruning, aligning weight importance with sparsity patterns to minimise accuracy loss, a process known as weight permutation. The team achieved this by moving beyond traditional heuristic or greedy algorithms, which struggle with the exponential growth of the permutation search space in large models.

Central to this breakthrough is a learnable permutation cost matrix, which quantifies the impact of swapping input channels within a weight matrix. This cost matrix is then used within a differentiable bipartite matching solver, enabling the framework to determine the optimal binary permutation matrix.

Crucially, the researchers designed this solver to be differentiable, allowing for gradient-based training despite the inherently discrete nature of permutation operations. Furthermore, a sparsity optimisation loss function was implemented to directly optimise the permutation operator, balancing task performance with alignment to a dense teacher model via knowledge distillation.

The innovation lies in the framework’s ability to jointly optimise channel permutation and structured pruning in an end-to-end manner, deriving a unique permutation matrix for each weight tensor. Experiments conducted on vision Transformers such as ViT, DETR, and DiT, alongside large language models including GPT, LLaMA, Qwen, and DeepSeek, demonstrate significant improvements over existing methods.

Results show the proposed framework achieves state-of-the-art structured sparsity, substantially reducing accuracy degradation compared to traditional greedy baselines across various benchmarks. This research opens avenues for deploying large-scale Transformer models on resource-constrained hardware, addressing the growing computational demands of modern AI applications.

By enabling more efficient pruning without significant accuracy loss, the framework facilitates the development of faster and more accessible AI systems, particularly in areas like computer vision and natural language processing. The team’s approach promises to unlock the full potential of these powerful models in real-world scenarios.

Learning Optimal Weight Permutations via Differentiable Bipartite Matching enables efficient and effective model compression

Researchers developed a novel, end-to-end learnable permutation framework to improve structured sparsity in model pruning, particularly for large language models and Transformer architectures. The study tackled the challenge of reordering model weights to better suit pruning patterns, overcoming limitations of existing greedy and heuristic algorithms.

This work introduces a learnable permutation cost matrix, quantifying the expense of swapping input channels within a weight matrix, enabling a more nuanced approach to reordering. Scientists engineered a differentiable bipartite matching solver, utilising the cost matrix to derive an optimal binary permutation matrix.

This innovative technique circumvents the non-differentiability inherent in discrete permutation operations, facilitating integration with gradient-based training procedures. The system delivers an efficient and accurate method for learning the permutation matrix with minimal computational burden. Experiments employed an end-to-end sparsity optimization loss function, simultaneously optimising the permutation operator and aligning it with a dense teacher model via knowledge distillation.

This approach achieves a balance between task performance and adherence to the desired sparsity pattern. The team multiplied the original weights by the derived permutation matrix, generating reordered weights that naturally align with the target sparsity. Validation was conducted on both vision and language Transformers, demonstrating state-of-the-art permutation results for structured sparsity.

The method achieves improved performance by addressing the mismatch between rigid sparsity patterns and inherent weight distributions, preventing unintentional pruning of important weights. This technique enables significant parameter reduction while maintaining hardware compatibility, particularly with GPUs accelerating 2:4 sparsity patterns.

Learned channel permutation optimises structured sparsity in Transformers by reordering feature maps

Scientists have developed a novel, end-to-end learnable permutation framework for structured sparsity, addressing limitations in current model pruning techniques. The research introduces a learnable permutation cost matrix which quantifies the cost associated with swapping input channels within a weight matrix, enabling a more refined approach to reordering.

A differentiable bipartite matching solver was designed to obtain the optimal binary permutation matrix, guided by the cost matrix, facilitating efficient learning with minimal computational overhead. Experiments revealed that this framework achieves state-of-the-art results in structured sparsity across both vision and language Transformers.

The team measured significant reductions in accuracy degradation compared to traditional greedy baselines, demonstrating improved performance on diverse benchmarks. Specifically, the method introduces an end-to-end sparsity optimization loss function, jointly optimising channel permutation and structured pruning to balance task performance with alignment to a dense teacher model.

Results demonstrate the framework’s ability to derive a dedicated permutation matrix for each weight tensor, reordering weights to align more naturally with the target sparsity pattern. This approach was validated on models including ViT, large language models, and vision-language models. The differentiable approximation of bipartite matching allows for efficient and accurate learning of the binary permutation matrix, overcoming the challenges posed by the discrete nature of permutation operations.

Measurements confirm the effectiveness of the proposed innovations in achieving a fine balance between task-specific performance and alignment with the dense teacher model through knowledge distillation. The breakthrough delivers significant improvements in structured sparsity, outperforming existing heuristic methods and paving the way for more efficient deployment of large-scale models on resource-constrained hardware.

Learned weight reordering enhances transformer sparsity and performance significantly

Scientists have developed a novel end-to-end learnable permutation framework designed to improve structured sparsity in large-scale transformer-based models. This method introduces a learnable permutation cost matrix which quantifies the expense of swapping input channels within a weight matrix, alongside a differentiable bipartite matching solver to determine the optimal reordering of weights given this cost.

A sparsity optimization loss function then directly refines the permutation process itself. Extensive validation across both vision and language models demonstrates consistent performance gains over existing state-of-the-art techniques. Ablation studies utilising cross-entropy and layer-wise distillation losses confirm the complementary benefits of combining these approaches, achieving peak results on benchmarks such as Arc Easy, Arc Challenge, and MMLU, alongside reduced perplexity.

The authors acknowledge that the effectiveness of the method may be influenced by the specific architecture and sparsity constraints employed. Future research could explore the application of this framework to other model compression techniques and investigate its adaptability to diverse model architectures and datasets. This work offers a powerful and generalizable strategy for model compression, achieving substantial reductions in model size with minimal impact on performance.

👉 More information
🗞 Learnable Permutation for Structured Sparsity on Transformer Models
🧠 ArXiv: https://arxiv.org/abs/2601.22980

Tags:

bipartite matching cost matrix. Large Language Models learnable permutation post-pruning sparsity optimization structured sparsity weight permutation

Shows Learnable Permutation Improves Transformer Model Sparsity Performance

Learning weight permutations for efficient Transformer sparsity enables faster and more memory-friendly inference

Learning Optimal Weight Permutations via Differentiable Bipartite Matching enables efficient and effective model compression

Learned channel permutation optimises structured sparsity in Transformers by reordering feature maps

Learned weight reordering enhances transformer sparsity and performance significantly

Rohail T.

Latest Posts by Rohail T.:

Quantum Light’s Wave-Particle Balance Now Fully Tunable

AI Swiftly Answers Questions by Focusing on Key Areas

Machine Learning Sorts Quantum States with High Accuracy