Reducing the computational demands of large language models remains a significant challenge, and researchers are increasingly focused on techniques like quantization to improve performance. Huanqi Hu, Bowen Xiao, Shixuan Sun, and colleagues present a new approach called LiquidGEMM, designed to accelerate these models through efficient 4-bit weight and 8-bit activation quantization. The team addresses a key bottleneck in existing systems, the slow dequantization process on standard processing cores, by developing a hardware-efficient method that streamlines this step, allowing for faster computation. Results demonstrate that LiquidGEMM significantly outperforms current state-of-the-art kernels, achieving substantial speedups both at the kernel level and within complete language model serving systems, representing a considerable advance in the field of efficient artificial intelligence.
Weights, 8-bit Activations, and Accuracy Loss
Quantization is a vital technique for accelerating large language model (LLM) inference by reducing memory requirements and improving computational efficiency. Using 4-bit weights and 8-bit activations (W4A8) offers a strong balance between accuracy and performance. However, existing W4A8 methods can experience accuracy degradation, particularly with larger models or complex tasks, stemming from information loss during quantization. Current approaches typically use post-training quantization (PTQ) or quantization-aware training (QAT). While QAT generally produces better results, it demands substantial computational resources and access to original training data.
PTQ, though more efficient, can still lead to performance drops when applied to models not specifically designed for low-bit quantization. This work investigates new methods to improve the accuracy and robustness of W4A8 quantization within a PTQ framework, aiming to minimise the accuracy difference between the full-precision model and its quantized version without extensive fine-tuning or access to original training data. By addressing these challenges, the team seeks to unlock the full potential of low-bit quantization for LLM deployment, enabling faster and more efficient inference on resource-constrained devices.
Low-Precision Quantization for Faster Inference
Large language models (LLMs) are increasingly powerful, but deploying them efficiently presents significant challenges. A key strategy is quantization, reducing the precision of numbers representing the model’s weights and activations, which reduces memory usage, accelerates computation, and lowers energy consumption. The goal is to deploy LLMs with high throughput, low latency, and reasonable resource usage. Quantization can lead to a loss of accuracy, and a significant portion of the research focuses on minimising this loss while maximising the benefits. Techniques explored include low-bit quantization down to 8-bit, 4-bit, and even 1-bit precision.
Activation-aware quantization (AWQ) considers activation distributions when quantizing weights, while SmoothQuant calibrates quantization to reduce the impact of outliers. Researchers also investigate outlier suppression techniques. Rotation-based quantization (SpinQuant) uses learned rotations to improve the quantization process, and dual transformation (DuQuant) distributes outliers to improve performance. Mixed precision quantization uses different precision levels for different parts of the model. System co-design (QServe) optimises both the quantization algorithm and the serving system together.
PagedAttention is an efficient memory management technique, and FlashAttention-3 speeds up attention calculations. Distillation trains a smaller, quantized model to mimic a larger, more accurate model. The research utilises popular open-source LLMs like Llama 2, Llama 3, Mistral 7B, Mixtral of Experts, and Yi as benchmarks. Frameworks like TensorRT-LLM, FlashAttention, and COMET are also employed. The key findings demonstrate that low-bit quantization is feasible with acceptable accuracy loss if the right techniques are used.
Activation-aware methods and system-level optimisation are crucial, and handling outliers is vital for successful quantization. The combination of algorithmic and system optimizations yields the best results. Overall, the research highlights a strong trend toward making LLMs more accessible and efficient through quantization and optimised serving, pushing the boundaries of how much LLMs can be compressed without sacrificing too much performance, enabling them to run on a wider range of hardware, including edge devices and consumer GPUs.
LiquidGEMM Accelerates LLM Inference with Quantization
Researchers have developed LiquidGEMM, a new technique for accelerating large language model (LLM) inference by leveraging 4-bit weight and 8-bit activation quantization (W4A8). This approach significantly reduces memory usage and boosts computational efficiency. Experiments demonstrate that LiquidGEMM achieves up to a 2. 90x speedup compared to state-of-the-art W4A8 kernels and an impressive 4. 94x end-to-end system-level speedup.
The core innovation lies in overcoming limitations in existing W4A8 implementations, which suffer from inefficient dequantization processes on standard CUDA Cores. Current methods struggle to keep pace with the high throughput of Tensor Cores. LiquidGEMM introduces LiquidQuant, a novel quantization scheme that streamlines dequantization, requiring only two arithmetic instructions per four elements, and avoids potential overflow issues. This allows for fast and efficient conversion of 4-bit weights into a format suitable for computation. Furthermore, the team designed an implicit fine-grained pipeline that seamlessly overlaps weight loading, dequantization, and matrix multiplication across different processing units within the GPU, maximising hardware utilization by concurrently performing these operations.
Compared to various quantized kernels within the NVIDIA TensorRT-LLM framework, LiquidGEMM delivers performance gains ranging from 1. 12x to 1. 63x, and achieves up to 1. 63x system-level speedup, demonstrating its practical impact on LLM serving performance.
LiquidGEMM Accelerates LLM Inference Significantly
This research addresses a key bottleneck in accelerating large language model (LLM) inference: the dequantization process within weight and activation quantization schemes. The team presents LiquidGEMM, a new hardware-efficient kernel designed to improve performance when using 4-bit weights and 8-bit activations. LiquidGEMM incorporates two main innovations: LiquidQuant, a fast and overflow-safe dequantization algorithm, and an implicit fine-grained pipeline that maximises parallelism across the graphics processing unit (GPU). Experimental results demonstrate significant speedups with LiquidGEMM, achieving up to 2.
90 times faster kernel performance and 4. 94 times faster end-to-end system performance compared to existing 4-bit weight and 8-bit activation kernels. Furthermore, LiquidGEMM delivers improvements of 1. 12 to 1. 63 times over the NVIDIA TensorRT-LLM library. These findings demonstrate that careful hardware-aware design can make 4-bit weight and 8-bit activation quantization both efficient and scalable for high-performance LLM inference.
👉 More information
đź—ž LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving
đź§ ArXiv: https://arxiv.org/abs/2509.01229
