Visil Achieves Unified Evaluation of Information Loss in Multimodal Video Captioning

Multimodal video captioning offers a powerful method for condensing lengthy video footage into concise summaries of keyframes and natural language. Po-han Li, Shenghui Chen, and Ufuk Topcu, all from The University of Texas at Austin, alongside Sandeep Chinchali, present a novel approach to evaluating the effectiveness of these summaries. Current evaluation metrics struggle to accurately measure information retained across different formats, such as comparing text with visual keyframes. Their research introduces the Video Summary Information Loss (ViSIL) score, a framework that quantifies lost information using vision-language model inference, providing a unified metric for comparison. Demonstrating a significant correlation with both human judgement and performance on Video Question Answering tasks, ViSIL enables optimised summary selection, achieving improved accuracy without increasing computational demands.

The research addresses a critical gap in current evaluation metrics, such as BLEU and ROUGE, which struggle to quantify information coverage when comparing text with visual data like keyframes. This innovative metric enables direct comparison of diverse summary formats, overcoming limitations imposed by their structural differences.

The study unveils ViSIL as a unified evaluation tool, capable of quantifying information loss by assessing a VLM’s ability to reconstruct a detailed caption from either the original video or a condensed multimodal summary. Researchers first generate a comprehensive textual proxy for the video, then measure the information loss by evaluating the VLM’s performance in recovering this caption using the summary compared to the original footage. Defined as the conditional pointwise mutual information I(C; V | V ), the metric effectively captures visual details overlooked by the summary, with lower scores indicating better information retention. The research demonstrates that ViSIL enables the selection of summaries that establish a Pareto-optimal frontier, achieving a 7% improvement in VQA accuracy compared to text-only summaries without increasing computational load. By quantifying information loss, ViSIL provides a means to balance information richness with processing efficiency, addressing a key challenge in applications like security surveillance and rapid video analysis. The work opens new avenues for enhancing human-in-the-loop systems and improving the performance of generative AI models reliant on rich semantic grounding.

The team’s contributions extend beyond a novel metric, offering a framework for evaluating the spectrum of multimodal video summaries, ranging from text-only to hybrid formats with varying keyframe densities. This research highlights that summary format, rather than inherent video understanding, primarily dictates processing demands such as response time and token consumption. By leveraging ViSIL for summary selection, scientists can optimise summaries for both accuracy and efficiency, paving the way for more effective video understanding and retrieval systems. Researchers addressed the limitations of traditional metrics like BLEU and ROUGE, which struggle to compare text with visual keyframes, by developing an information-theoretic framework grounded in vision-language model (VLM) inference. The core of ViSIL lies in measuring the information lost when condensing a video into a summary, providing a unified score applicable to various summary formats regardless of structural differences. This approach enables direct comparison of summaries, a significant advancement over embedding-similarity methods hampered by differing encoder structures and vector dimensions.

To establish ViSIL’s theoretical basis, the team began by defining Mutual Information (MI) and Pointwise Mutual Information (PMI), acknowledging that PMI offers a measure of association between individual events rather than random variables. The study formulates the problem of multimodal video summarization by defining a video as a combination of image frames and an audio track, and a summary as a subset of keyframes paired with textual descriptions. Researchers then aimed to evaluate summary quality by calculating the PMI between the original video and its summary, positing that a higher score indicates greater information retention. Recognizing the computational intractability of directly calculating this PMI, scientists employed autoregressive models to approximate the conditional probability of generating the original video given the summary, and the existential likelihood of the data point.

The work details a mathematical formulation where the quality of a summary is assessed by maximizing the PMI, but acknowledges the need for approximation due to limitations in current video generation models. Further validation involved establishing a Pareto-optimal frontier, optimizing the trade-off between information loss and processing speed, and demonstrating that ViSIL-selected summaries outperformed text summaries in VQA accuracy without increasing computational load. Human-centric evaluation, considered the gold standard, was also integrated, building on prior work aligning model representations with human perception of attention, temporal dynamics, and conceptual structures. The research addresses a critical gap in evaluation metrics, as traditional methods like BLEU and ROUGE are inadequate for comparing information across different modalities, such as text and keyframes. Experiments reveal that ViSIL measures the semantic information not captured by a summary through vision-language model (VLM) inference, providing a unified metric applicable to diverse multimodal summary formats. Defined as the conditional pointwise mutual information I(C; V | V ) = log P (C|V ) P (C| V ), the metric effectively captures visual details lost during summarization, with lower scores indicating better information retention. Results demonstrate that ViSIL enables the selection of summaries that optimize the balance between information loss and processing speed. Scientists established a Pareto-optimal frontier, achieving a 7% improvement in VQA accuracy compared to text summaries without increasing processing load.

Measurements confirm that the framework can evaluate a spectrum of multimodal summaries, ranging from text-only to hybrid formats with varying keyframe densities. This breakthrough delivers a powerful tool for evaluating and refining video summarization techniques, with potential applications in areas like security surveillance and rapid video analysis. Further tests recorded ViSIL scores of -2.84, correlating with strong video understanding, alongside VQA/Correspondence Accuracy scores of 71.15 and 99.54. The study highlights the importance of combining visual keyframes with linguistic descriptors to create rich semantic grounding for video understanding and efficient retrieval, paving the way for advancements in text-to-video generation and retrieval-augmented generation.

👉 More information
🗞 ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning
🧠 ArXiv: https://arxiv.org/abs/2601.09851

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Algorithm Achieves Targeted Disorder in Networks with Coordination Numbers up to Four

Algorithm Achieves Targeted Disorder in Networks with Coordination Numbers up to Four

January 19, 2026
Tantalum Nitride Nanowires Achieve 100x Heat Transfer Improvement with Integrated Heatsinking

Tantalum Nitride Nanowires Achieve 100x Heat Transfer Improvement with Integrated Heatsinking

January 19, 2026
Machine Learning Advances MEP Prediction with Quadrupole Moments, Improving QM9 Dataset Accuracy

Machine Learning Advances MEP Prediction with Quadrupole Moments, Improving QM9 Dataset Accuracy

January 19, 2026