Multimodal large language models represent a significant advance in artificial intelligence, yet they frequently generate fabricated details , known as hallucinations , that contradict the visual information they process. Shangpin Peng from Harbin Institute of Technology, Shenzhen, Senqiao Yang and Li Jiang from The Chinese University of Hong Kong, and their colleagues, now demonstrate a method to dramatically reduce these inaccuracies. Their research focuses on the crucial finding that hallucinations typically emerge early in the text generation process and then persist throughout the output. To address this, the team developed SENTINEL, a framework that proactively intervenes at the sentence level, identifying and correcting potential hallucinations without relying on extensive human annotation, and achieving over a 90% reduction in fabricated content compared to existing methods. This advancement not only improves the reliability of multimodal models but also enhances their overall performance on a range of tasks, paving the way for more trustworthy and capable AI systems.
Mitigating Hallucinations in Multimodal Large Language Models
Recent advancements in multimodal large language models (MLLMs) have significantly improved the ability of artificial intelligence systems to understand and connect visual and textual information. However, a persistent challenge remains: the tendency of these models to “hallucinate”, that is, to generate fabricated details or inconsistencies that do not align with the provided image content. This issue undermines user trust and poses risks in real-world applications, hindering the development of truly reliable AI systems. Researchers recognized that hallucinations tend to escalate as the model generates longer responses, but crucially, that intervening early, at the sentence level, can significantly reduce their spread.
To address this, the team developed SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a novel framework designed to identify and correct hallucinations at their point of origin. Unlike existing methods, SENTINEL operates entirely within the model’s own data distribution, avoiding the need for external resources or manual intervention. The system works by iteratively sampling model outputs, validating object existence using image analysis tools, and classifying sentences as either hallucinated or factual. This process creates a dataset of “preference pairs”, examples of correct and incorrect sentences, which are then used to train the model to prioritize accurate responses. Experimental results demonstrate that SENTINEL significantly reduces object hallucinations, by over 90% on certain benchmarks, while simultaneously preserving the model’s ability to perform other tasks. By focusing on early intervention and utilizing in-domain data, SENTINEL offers a cost-effective and efficient solution for improving the reliability and trustworthiness of multimodal large language models, paving the way for more robust and dependable AI systems.
Early Intervention Reduces Multimodal Hallucinations
Researchers developed SENTINEL, a novel methodology designed to reduce fabricated content, known as hallucinations, in multimodal large language models (MLLMs). The core insight driving this method is that hallucinations tend to intensify as text length increases, and addressing them at the sentence level significantly reduces their propagation throughout the generated output. To implement this early intervention strategy, the team bootstrapped a high-quality dataset without human annotation, a significant departure from many existing techniques. This involved iteratively sampling model outputs and then validating the existence of objects mentioned in the text using two independent open-vocabulary detectors, effectively confirming whether the model’s statements align with the visual input.
Sentences were then classified as either hallucinated or non-hallucinated, creating pairs of positive and negative examples for training. This builds context-aware preference data, allowing the model to learn which outputs are consistent with the provided image. The team then trained the MLLM using a context-aware preference loss, which emphasizes learning at the sentence level, precisely where hallucinations initially manifest. This discriminative learning approach encourages the model to distinguish between factual and fabricated content early in the generation process, preventing the escalation of hallucinations in subsequent sentences. The resulting system demonstrably reduces hallucinations by over 90% compared to the original model and outperforms existing state-of-the-art techniques on relevant benchmarks.
Early Hallucinations Drive Multimodal Output Errors
Recent advances in multimodal large language models have unlocked significant progress in understanding both images and text, yet these systems still struggle with a critical problem: hallucinations, where they generate content that contradicts the visual information provided. Researchers have now identified that hallucinations tend to emerge early in the text generation process and then propagate throughout the remaining output, suggesting a targeted intervention at the beginning of generation could be highly effective. To address this, a new framework called SENTINEL has been developed, focusing on sentence-level early intervention through in-domain preference learning. Unlike existing approaches, SENTINEL avoids reliance on large external language models or human annotators, preserving the model’s original distribution and expression patterns while curbing hallucination propagation.
The system begins by bootstrapping high-quality data by repeatedly sampling outputs, verifying object existence within the image, and classifying sentences as either hallucinated or factual. SENTINEL then employs a context-aware preference learning technique, maximizing the likelihood of generating contextually appropriate positive samples while minimizing the generation of hallucinated negative samples. Experimental results demonstrate a substantial reduction in hallucinations, with over 90% fewer instances observed on benchmark datasets like Object Halbench and a 65% reduction on AMBER. Importantly, this improvement in accuracy does not come at the cost of general capabilities, as the model maintains or even improves performance on tasks like VQAv2, TextVQA, ScienceQA, and MM-Vet.
SENTINEL Halves Hallucinations in Multimodal Models
The research presents SENTINEL, a new framework designed to reduce hallucinations, the generation of fabricated content, in multimodal large language models (MLLMs). The key innovation lies in identifying that hallucinations originate early in the text generation process and then propagate through subsequent outputs. SENTINEL addresses this by iteratively building a dataset of high-quality examples, validating object existence using multiple open-vocabulary detectors, and classifying sentences as either hallucinated or non-hallucinated. This data is then used to train the model to discriminate between accurate and fabricated content at the sentence level. Experimental results demonstrate that SENTINEL reduces hallucinations by over 90% compared to existing methods and improves performance on both hallucination benchmarks and general capability benchmarks. The framework’s effectiveness stems from its reliance on in-domain preference data and the use of cross-validation with multiple object detectors, ensuring a higher degree of accuracy in identifying and mitigating hallucinations.
👉 More information
🗞 Mitigating Object Hallucinations via Sentence-Level Early Intervention
🧠 DOI: https://doi.org/10.48550/arXiv.2507.12455
