Large Language Models (LLMs) are transforming natural language processing, but their immense computational demands hinder widespread deployment. Xinyu Wang, Vahid Partovi Nia, and Peng Lu, from McGill University, along with colleagues, address this challenge by improving a technique called power-of-two (PoT) quantization, which reduces the precision of the numbers LLMs use. Their new framework, named PoT-PTQ, not only achieves greater accuracy than existing methods when using extremely low-precision number formats, but also accelerates the process of converting these simplified numbers back to a usable format, particularly on modern graphics processing units. By introducing a two-step post-training algorithm to carefully calibrate the quantization process, the team demonstrates significant performance gains, paving the way for faster and more efficient LLMs that can run on a wider range of hardware. This advancement promises to make powerful language models more accessible and sustainable.

Language Background

Large Language Models (LLMs) demonstrate remarkable capabilities across a wide range of natural language processing (NLP) tasks, including text generation, summarization, and question answering. Deploying these models remains challenging due to their substantial memory and computational requirements, which pose obstacles when using them in real-world applications. Quantization offers an effective strategy to reduce these costs by converting full-precision weights into lower-bit representations, thereby reducing memory usage and accelerating computation. Despite its benefits, aggressive quantization can lead to accuracy degradation, particularly in generation tasks.

Extreme Quantization for Large Language Models

Methods offer a practical trade-off, compressing pretrained models without requiring full retraining and remaining compatible with small calibration sets. Maintaining accuracy at greater quantization levels, such as 2-bit, remains difficult. Power-of-Two (PoT) quantization offers a promising direction by enabling multiplications to be replaced with simple shift-and-add operations, leading to substantial speed gains. This structure makes PoT not only hardware-efficient but also statistically well-suited for deep models. However, existing methods fail to retain accuracy when applied to LLMs due to coarse rounding and lack of effective post-training calibration.

Naive dequantization can also be inefficient on modern GPUs due to bit-level dependencies and sign-bit entanglement. This work proposes a novel post-training quantization framework for LLMs using power-of-two values, achieving both high accuracy in the low-bit regime and fast, hardware-friendly inference. The method introduces a two-stage algorithm that combines robust scale initialization with lightweight calibration tailored to the PoT structure. Results demonstrate consistent outperformance of strong PTQ baselines at 2- and 3-bit precision across standard benchmarks. A GPU-optimized dequantization kernel leverages bitwise parallelism, resulting in up to 3.67× speed-up on an NVIDIA V100 and 1.63× on a RTX 4090 compared to standard integer dequantization.

Two-Bit Quantization Achieves Leading LLM Performance

This research introduces POTPTQ, a novel quantization technique for Large Language Models (LLMs) that achieves state-of-the-art performance with extremely low bit-widths, down to 2-bit. The core idea leverages a combination of Power-of-Two (PoT) Quantization, a two-stage quantization process, and output-aligned fine-tuning. The two-stage process involves scale initialization using a grid search and calibration-based fine-tuning using a small calibration dataset. Key results demonstrate state-of-the-art accuracy, outperforming existing post-training quantization (PTQ) methods across various LLMs. The method achieves excellent performance with 2-bit and 3.25-bit quantization, significantly reducing model size and memory footprint.

The PoT-based dequantization kernel is significantly faster than standard FP16 dequantization, achieving up to 3.66x speedup on Tesla V100 and 1.48x on RTX 4090. The framework is scalable and can be applied to larger LLMs, preserving performance on downstream tasks. POTPTQ is a promising technique for deploying LLMs in resource-constrained environments. The significant reduction in model size and memory footprint, combined with fast inference speeds, makes it ideal for edge devices and other applications where computational resources are limited. The two-stage quantization process with output-aligned fine-tuning is crucial for achieving high accuracy with ultra-low bit-widths. PoT quantization offers a compelling trade-off between accuracy and efficiency, enabling faster hardware implementation without significant performance degradation.

Power-of-Two Quantization Improves LLM Inference

This work presents a novel post-training quantization framework that leverages power-of-two (PoT) representations to enable efficient and hardware-friendly inference. The method introduces a two-stage algorithm consisting of data-agnostic scale initialization and data-driven fine-tuning, effectively addressing the accuracy limitations commonly observed in traditional PoT quantization schemes. Through comprehensive experiments on LLaMA models at 2-bit and 3-bit precision, the approach consistently outperforms existing PTQ methods in both perplexity and real-world deployment scenarios. In addition to its strong accuracy, the framework supports parallelizable grid search and lightweight calibration, making it practical for deployment on standard hardware without requiring retraining or large calibration datasets. The optimized integer-only dequantization kernel significantly accelerates inference, achieving up to 3.67× speed-up on NVIDIA V100 and 1.63× on RTX 4090 compared to traditional FP16-based methods. These results highlight the potential of PoT quantization as a scalable and effective solution for low-latency LLM deployment.

👉 More information
🗞 PoTPTQ: A Two-step Power-of-Two Post-training for LLMs
🧠 DOI: https://doi.org/10.48550/arXiv.2507.11959

Tags:

Dequantization inference speedup integer quantization Large Language Models low-precision number formats. Post-Training Quantization power-of-two quantization RTX 4090 V100

Quantum News

Power-of-Two Quantization Improves LLM Accuracy and Accelerates Inference on GPUs

Language Background

Extreme Quantization for Large Language Models

Two-Bit Quantization Achieves Leading LLM Performance

Power-of-Two Quantization Improves LLM Inference

Latest Posts by Quantum News:

QED-C Announces Research Advances in Quantum Control Electronics

Sophus Technology to Showcase Quantum Solver Delivering Faster Optimization

SEALSQ Expands Japan Presence to Support 2035 Quantum Security Mandate