Large language models demonstrate remarkable capabilities, but their substantial memory requirements hinder deployment on everyday devices. Bingxin Xu, alongside Zhen Dong from UCSB, Oussama Elachqar from Oumi, and Yuzhang Shang from Oumi and UCF, address this challenge with a novel approach to model compression. The team introduces ButterflyQuant, a technique that dramatically reduces memory usage by quantizing models to extremely low bit-widths, specifically two bits, without sacrificing performance. Unlike existing methods that rely on fixed transformations, ButterflyQuant employs learnable, yet mathematically constrained, transformations, inspired by butterfly decomposition, that adapt to the unique characteristics of each layer within the model. This innovation allows ButterflyQuant to suppress problematic data points that typically arise during extreme quantization, achieving significantly improved accuracy and opening the door to deploying powerful language models on resource-constrained hardware.
Learnable Butterfly Transforms for Efficient LLM Quantization
This research introduces ButterflyQuant, a new technique for compressing large language models (LLMs) that improves performance and efficiency. The method addresses a key challenge in reducing model size: maintaining accuracy when using extremely low precision. ButterflyQuant utilizes a structured, learnable approach based on Butterfly Transforms, which are known for their computational efficiency. Unlike existing methods that rely on fixed transformations, ButterflyQuant learns the optimal transformation during a short calibration process, requiring only a small dataset and minimal computational resources.
The core innovation lies in replacing fixed orthogonal transformations with learnable butterfly transforms parameterized by continuous angles, enabling optimization via gradient descent. This offers a significant advantage over methods limited by discrete, non-differentiable transformations. The method’s efficiency stems from its O(n log n) complexity and relatively low parameter count. Theoretical guarantees underpin the method, ensuring effective suppression of outliers and maintaining model accuracy. Researchers discovered that different layers within transformer models exhibit distinct outlier patterns, motivating the development of layer-adaptive rotations for optimal performance.
Learnable Butterfly Transforms for Extreme Quantization
To address the challenges of deploying large language models on limited hardware, this study pioneers ButterflyQuant, a novel quantization technique that significantly reduces memory footprint without sacrificing performance. The research focuses on mitigating performance loss caused by outlier activations during extreme 2-bit quantization. Scientists developed a method that replaces fixed Hadamard transforms with learnable butterfly transforms parameterized by continuous Givens rotation angles, enabling smooth optimization via gradient descent. The team constructed the butterfly transforms to guarantee orthogonality, ensuring theoretical benefits in outlier suppression while achieving computational complexity of O(n log n) with only n log n/2 learnable parameters.
Researchers identified that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations. To facilitate this adaptation, the study employed a uniformity regularization on post-transformation activations, promoting smoother distributions more amenable to quantization. Experiments demonstrate that ButterflyQuant achieves a substantial improvement in perplexity on LLaMA-2-7B with 2-bit quantization, highlighting the effectiveness of the learnable, layer-adaptive approach. This method effectively redistributes activations across channels, smoothing outlier features without altering the layer’s output and enabling significant memory compression.
Adaptive Quantization Boosts Large Language Model Compression
ButterflyQuant, a new quantization technique, achieves significant improvements in compressing large language models by addressing the limitations of existing methods. The research focuses on reducing the memory footprint of models like LLaMA-2-7B through 2-bit quantization. ButterflyQuant introduces learnable orthogonal butterfly transforms, replacing fixed Hadamard rotations with a more adaptable approach. These transforms utilize continuous Givens rotation angles, enabling optimization via gradient descent. Experiments demonstrate that ButterflyQuant achieves a substantial improvement in perplexity on LLaMA-2-7B with 2-bit quantization.
The method leverages the principle of incoherence, minimizing quantization error by distributing information evenly across dimensions, and guarantees orthogonality by construction. This ensures theoretical benefits in suppressing outliers while maintaining computational efficiency with only learnable parameters. The team discovered that different layers within transformer models exhibit distinct outlier patterns, motivating the development of layer-adaptive rotations. ButterflyQuant’s structured parameterization achieves O(n log n) complexity, balancing expressiveness with efficiency, and requires minimal calibration data for learning, converging quickly on a single GPU.
Learnable Butterfly Transforms Enable Extreme Quantization
ButterflyQuant introduces a new approach to quantizing large language models, achieving significant improvements in performance with extremely low precision. The research demonstrates that replacing fixed orthogonal transformations with learnable butterfly transforms enables more effective quantization. This method adapts to layer-specific patterns within the model, unlike previous techniques that relied on a single, universal transformation. The result is a substantial reduction in model size without significant loss of accuracy. By parameterizing the butterfly transforms with continuous angles, the method avoids the limitations of discrete, non-differentiable fixed transformations, allowing for smooth and efficient learning. Furthermore, a uniformity regularization technique prevents pathological concentration of activations, improving generalization beyond the calibration dataset. While the authors acknowledge the complexity of scaling these techniques to even larger models, they demonstrate a promising path toward deploying large language models on resource-constrained hardware.
👉 More information
🗞 ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
🧠 ArXiv: https://arxiv.org/abs/2509.09679
