As multimodal large language models become increasingly prevalent, adapting them to specific user requirements presents a significant challenge, and researchers are now exploring ways to exert greater control over their outputs. Oscar Mañas, Pierluca D’Oro, and Koustuv Sinha, along with colleagues at Mila and Meta FAIR, present the first method for reward-guided decoding of these complex models, demonstrating a powerful new approach to improving visual grounding. The team’s technique builds rewards that independently control the precision and recall of objects in an image, allowing users to dynamically adjust the balance between accurate detail and comprehensive coverage during image captioning. This on-the-fly controllability extends to managing computational cost, offering a trade-off between processing power and the quality of visual grounding, and the results demonstrate substantial improvements over existing methods designed to reduce inaccuracies in multimodal outputs.

Mitigating Hallucinations in Image Captioning Models

This document provides supplementary material for research on controlling Multimodal Large Language Models (MLLMs) using Reward-guided Decoding (MRGD). It details experimental setups, results, and comparisons with existing methods for mitigating visual hallucinations in image captioning, explaining how the team evaluated their approach using key metrics including hallucinated objects, object recall, and caption length. Results demonstrate that MRGD effectively reduces hallucinations compared to simpler prompting techniques, particularly for certain model architectures. The team also investigated using a pre-trained visual language model as a reward signal, finding it performs well but is surpassed by a model specifically fine-tuned on preference data. Adjusting the weighting of the hallucination reward allows for a flexible trade-off between precision and recall, enabling users to tailor the model’s output to their specific needs.

Reward-Guided Decoding for Visual Grounding Accuracy

Researchers have developed a novel method for controlling the behaviour of multimodal large language models (MLLMs), focusing on improving how accurately these models ground their responses in visual information. This approach, termed reward-guided decoding, actively shapes the model’s output during inference, allowing for greater user control. The core innovation lies in building dedicated reward models that assess different aspects of visual grounding, specifically object precision and recall, and then using these rewards to guide the model’s search for the best possible output. This method addresses a key challenge in MLLMs: balancing the need for accurate detail with comprehensive coverage of visual elements. By creating separate reward functions, the system can independently evaluate how precisely the model identifies objects and how thoroughly it captures all relevant objects within an image, allowing users to dynamically adjust the trade-off between these two factors.

Visual Grounding Improves MLLM Output Quality

Researchers have developed a new method for controlling the outputs of multimodal large language models (MLLMs), addressing a common problem of hallucinations, where the system generates details not present in the original image, and providing users with greater precision and recall in image descriptions. The team’s approach focuses on guiding the decoding process, influencing how the MLLM constructs its responses, rather than retraining the entire model or relying on prompting techniques. The innovation lies in building reward models that assess the quality of generated text based on visual grounding, how well the text aligns with the image content. Two separate reward models are employed: one to encourage precision, minimizing hallucinated objects, and another to maximize recall, ensuring comprehensive descriptions. By combining these rewards, users can dynamically adjust the balance between accuracy and completeness in the generated captions, tailoring the output to their specific needs.

Controllable Captioning Balances Precision and Recall

This research introduces a new method for controlling the outputs of multimodal large language models, specifically focusing on improving visual grounding in generated captions. The team developed a reward-guided decoding technique that allows users to adjust the balance between identifying all relevant objects (recall) and ensuring the accuracy of those identifications (precision). This control is achieved through a system that evaluates candidate captions based on two independent reward models, one for precision and one for recall, and iteratively refines the output. The method demonstrates significant controllability over the model’s inferences, consistently outperforming existing techniques designed to mitigate object hallucinations. By adjusting the weighting of the reward models, users can fine-tune the model’s performance to prioritize either comprehensive object identification or accurate descriptions.

👉 More information
🗞 Controlling Multimodal LLMs via Reward-guided Decoding
🧠 ArXiv: https://arxiv.org/abs/2508.11616

Tags:

Multimodal Large Language Models