Quantizing vision-language models presents a significant challenge as practitioners seek to reduce computational cost without sacrificing performance. Gautom Das, Vincent La, and Ethan Lau from the University of Maryland, College Park, alongside Abhinav Shrivastava and Matthew Gwilliam, investigate best practices for aggressively quantizing these complex multimodal pipelines. Their research explores how techniques like GPTQ and AWQ impact captioning, retrieval, and question answering when applied to the vision, language, and connection components of such models. Crucially, the team demonstrate that both the visual transformer (ViT) and the large language model (LLM) contribute comparably to overall performance, despite differing parameter sizes, and that lower-bit quantization of the LLM can yield surprisingly high accuracy , offering valuable guidance for deploying efficient multimodal large language models.
The team conducted a dense grid search of bit widths and combinations of model components, block groups, and layer types to understand the sensitivity of each part of the MLLM to quantization. This systematic approach allowed for a detailed analysis of how different quantization strategies affect performance across various tasks and model architectures, such as BLIP-2 and LLaVA. The research unveils several key principles governing MLLMs and quantization strategies, providing practical guidance for efficient deployment.
Task characteristics also play a vital role, with reasoning tasks favouring LLM precision, while visual-textual alignment tasks exhibit more balanced requirements. Moreover, the choice of quantization method dramatically redistributes component importance, with AWQ concentrating on LLM preservation and GPTQ distributing importance more evenly. Architectural dependencies create interaction effects, necessitating holistic pipeline analysis rather than independent component evaluation. These findings highlight that not all components are equally sensitive to a reduction in precision, allowing for targeted quantization strategies that optimize the model size/task performance trade-off. By minimizing information loss across salient model components, the team paves the way for deploying quantized multimodal models in resource-constrained environments and broadening access to these powerful technologies.
Quantization of BLIP-2 for Multimodal Performance
Experiments harnessed the BLIP-2 model for captioning tasks, measuring performance using the CIDEr metric as a function of bits per weight (bpw) and quantization method. The team quantified captioning performance at varying bit widths, 16, to establish a detailed performance profile for both GPTQ and AWQ quantization strategies. Furthermore, the research team assessed the LLaVA model, focusing on question answering accuracy, and comparing performance with 3-bit ViT, 16-bit ViT, 3-bit LLM, and 8-bit LLM configurations. This work pioneered a comparative analysis of ViT and LLM sensitivity to quantization, revealing that despite substantial differences in parameter size, both components exhibit comparable importance to overall model performance. By meticulously mapping component sensitivities and optimal bit allocations, the research provides practical insights for deploying efficient MLLMs and optimizing the trade-off between model size and task performance.
Quantization Optimises MLLM Memory and Latency
Scientists achieved significant reductions in both memory and latency within multimodal large language models (MLLMs) through aggressive quantization techniques. Experiments revealed that ViT and LLM components exhibit comparable importance in overall model performance, despite substantial differences in their parameter sizes. Results demonstrate that the BLIP-2 model, when evaluated on the COCO Captions dataset using the CIDEr metric, maintains performance across a range of quantization levels from 4 to 16 bits per weight. Specifically, the team measured CIDEr scores that remained relatively stable as bit width decreased, indicating effective preservation of captioning quality.
Further analysis using the LLaVA model and GPTQ quantization revealed that accuracy was maintained across varying bit allocations for both the ViT and LLM components. Tests prove that a 3-bit ViT combined with a 16-bit LLM yields comparable results to an 8-bit LLM paired with a 3-bit ViT, highlighting the nuanced interplay between component precision. Data shows that task characteristics fundamentally determine optimal bit allocations, with reasoning tasks favouring LLM precision and visual-textual alignment tasks exhibiting more balanced requirements. Researchers recorded that the choice of quantization method, GPTQ versus AWQ, dramatically redistributes component importance, with AWQ concentrating on LLM preservation and GPTQ distributing importance more evenly. These findings highlight the value of holistic pipeline analysis, as architectural dependencies create interaction effects that necessitate evaluating the entire system rather than individual components. This work provides a foundation for deploying quantized MLLMs that optimise the model size/task performance trade-off, addressing critical issues of cost and accessibility.
Quantization Sensitivity in Vision-Language Models is a critical
Scientists have demonstrated a systematic investigation into the effects of quantization on vision-language models, focusing on understanding how components within multimodal architectures respond to reduced precision. These findings offer practical guidance for optimising the performance-efficiency trade-off in multimodal systems, potentially enabling wider deployment in resource-constrained settings. However, simultaneously quantizing both ViT and LLM degrades performance more than individual quantization, highlighting non-additive effects arising from sequential dependencies within the pipeline. The authors acknowledge that their study is limited to simulated quantization, not capturing end-to-end latency or hardware-specific optimisations, but suggest future work could address these constraints. They have released open-source tools, including calibration implementations and component analysis, to facilitate further research by the wider community. This methodology developed in this paper offers a systematic approach to quantifying component importance across diverse architectures, informing practical trade-offs in multimodal model compression for deployment on devices with limited resources, such as mobile phones or edge computing platforms.
👉 More information
🗞 Towards Understanding Best Practices for Quantization of Vision-Language Models
🧠 ArXiv: https://arxiv.org/abs/2601.15287
