Large Vision-Language Models Advance Reliability with CoFi-Dec’s Hallucination Resistance and Multi-Level Grounding

Large vision-language models excel at understanding and generating content from images and text, but a persistent problem limits their practical use: these models frequently ‘hallucinate’, producing details inconsistent with the original visual input. Zongsheng Cao from the University of Minnesota, alongside Yangfan He, Anran Liu, Jun Xie, Feng Chen, and Zepeng Wang from Lenovo, addresses this challenge with a new decoding framework called CoFi-Dec. The team’s innovation lies in mimicking the human visual process, starting with a broad understanding of a scene before focusing on finer details, and integrating this approach with generative self-feedback. This training-free method significantly reduces both factual errors and semantic inconsistencies in generated text, demonstrably improving the reliability of large vision-language models across a range of challenging benchmarks and offering a broadly applicable solution for mitigating hallucinations.

CoFi-Dec Hallucination Reduction, Detailed Experiments

This section presents extensive experimental results and analysis supporting the effectiveness of CoFi-Dec in reducing hallucinations and improving performance in large vision-language models. It validates claims made in the main paper, provides detailed data and qualitative examples, demonstrates robustness across benchmarks and model configurations, addresses computational efficiency, and enables reproducibility. Experiments on the MMVP benchmark show CoFi-Dec improves performance, particularly in tasks requiring fine-grained visual discrimination, highlighting its ability to enhance precision in visual recognition. On LLaVA-Bench, CoFi-Dec generates more grounded, informative, and accurate responses, confirmed by GPT-4V evaluation with higher accuracy and detail scores, and visual examples demonstrate the reduction of hallucinations.

Efficiency comparisons reveal CoFi-Dec is less efficient than simpler methods but more efficient than computationally intensive approaches, with a detailed breakdown of time spent in each stage of the process. Results on POPE demonstrate consistent high scores across different model configurations, while performance on popular and adversarial benchmarks further reinforces the generalizability of the approach. This supplemental material comprehensively covers multiple benchmarks and evaluation metrics, combining numerical results with visual examples and providing insights into computational cost and trade-offs, ultimately supporting the claim that CoFi-Dec is an effective method for improving large vision-language models.

Coarse-to-Fine Decoding Reduces Hallucinations in Vision-Language Models

The study introduces CoFi-Dec, a training-free decoding framework that reduces hallucinations in large vision-language models by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process, the method initially generates textual responses conditioned on coarse and fine-grained views of an image, creating multi-level visual hypotheses to enrich grounding cues. The core innovation is a Wasserstein-based fusion mechanism that aligns predictive distributions from multiple visual conditions into a geometrically consistent decoding trajectory, reconciling semantic consistency with fine-grained visual grounding. This process decomposes the input image into coarse and fine representations, allowing the model to consider both overall scene and specific details, mitigating errors from misleading global interpretations.

Experiments on six hallucination-focused benchmarks demonstrate substantial reductions in entity-level and semantic-level hallucinations compared to existing decoding strategies. Notably, the team engineered a model-agnostic framework requiring no additional training and seamlessly applicable to a wide range of large vision-language models. Operating entirely at decoding time, it leverages existing model capabilities without parameter modification or annotated data, delivering more robust and faithful outputs and expanding the utility of these models.

Coarse-to-Fine Decoding Reduces Hallucinations in Vision-Language Models

Scientists have developed CoFi-Dec, a novel decoding framework that substantially reduces hallucinations in large vision-language models without additional training. The method addresses the tendency of these models to generate content inconsistent with visual input, improving their reliability for real-world applications. Inspired by human visual perception, it mimics multi-scale visual reasoning by generating textual responses from coarse- and fine-grained views of an image. This process allows the model to cross-reference visual semantics at multiple scales and verify generated content through self-imagined feedback.

A key innovation is the Wasserstein-based fusion mechanism, which aligns predictive distributions into a geometrically consistent decoding trajectory, reconciling high-level semantic consistency with fine-grained visual grounding. Tests on six hallucination-focused benchmarks demonstrate significant reductions in both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic and requires no additional training, making it readily applicable to a wide range of large vision-language models and offering a practical solution for mitigating hallucinations. This breakthrough enhances the trustworthiness of vision-language models and expands their potential in safety-critical and real-world scenarios.

CoFi-Dec Reduces Hallucinations in Vision-Language Models

Researchers have developed CoFi-Dec, a new framework to address inaccurate or hallucinated content generated by large vision-language models. Drawing inspiration from human visual perception, it integrates broad contextual understanding and detailed visual inspection to improve model reliability. CoFi-Dec generates multiple textual responses based on different levels of image detail and refines the model’s output through internal verification and consistency checks. The team demonstrates that CoFi-Dec significantly reduces factual errors and inconsistencies between generated text and visual input across challenging benchmarks.

This framework is adaptable and can be applied to various existing large vision-language models without additional training, offering a broadly applicable solution. While the current implementation utilizes a specific image generation model for efficiency, results indicate stable performance across different options. Future work will explore methods for adaptively selecting the most relevant levels of detail and improving the quality of the internal feedback mechanism, further enhancing self-correction capabilities. This research represents a step towards more trustworthy and accurate multimodal generation, paving the way for wider real-world applications of these powerful models.

👉 More information
🗞 CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models
🧠 ArXiv: https://arxiv.org/abs/2512.23453

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Graphsb Achieves 4.57% Performance Boost in Imbalanced Node Classification

Graphsb Achieves 4.57% Performance Boost in Imbalanced Node Classification

January 30, 2026
Gnn-Enhanced MARL Achieves Aoi-Aware Queue Management for V2V Networks

Gnn-Enhanced MARL Achieves Aoi-Aware Queue Management for V2V Networks

January 30, 2026
Fixed Aggregation Features Rival GNNs on 12 Benchmarks, Achieving State-Of-The-Art Results

Fixed Aggregation Features Rival GNNs on 12 Benchmarks, Achieving State-Of-The-Art Results

January 30, 2026