Haven Achieves 84.1% Long Video Understanding with Audiovisual Entity Cohesion

Researchers are tackling the formidable challenge of enabling vision-language models to comprehend extremely long videos. Xinlei Yin, Xiulian Peng, and Xiao Li, alongside Zhiwei Xiong and Yan Lu from the University of Science and Technology of China, introduce a novel framework , HAVEN , designed to overcome the limitations of current approaches which often fragment information and lose crucial global context. Their work uniquely integrates audiovisual entity cohesion with hierarchical video indexing and an agentic search mechanism, allowing for dynamic reasoning across multiple layers of video content. This innovative method achieves a new state-of-the-art accuracy of 84.1% on the LVBench benchmark, and a remarkable 80.1% in challenging reasoning tasks, demonstrating a significant leap forward in comprehensive and context-consistent long-form video understanding.

Scientists are addressing information fragmentation and a loss of global coherence in long-video understanding. We present HAVEN, a unified framework that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organising content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking.

Long-video understanding via structured multimodal reasoning requires effective

Scientists demonstrate that their method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight0.1). Their approach is based on two core innovations: audiovisual0.0% on the LVBench. g., frames, clip captions,. g. ,0.1% on the LVBench benchmark. The research addresses the challenges posed by extremely long context windows in vision-language models, which often suffer from information fragmentation and loss of global coherence.

Experiments revealed that HAVEN effectively integrates audiovisual entity cohesion and hierarchical video indexing with an agentic search mechanism to enable coherent and comprehensive reasoning. This innovative approach preserves semantic consistency by integrating entity-level representations across both visual and auditory streams, organising content into a structured hierarchy spanning global summary, scene, segment, and entity levels. The team measured temporal coherence, entity consistency, and retrieval efficiency, demonstrating significant improvements over existing methods. Results demonstrate that the framework’s agentic search mechanism facilitates dynamic retrieval and reasoning across these hierarchical layers, enabling coherent narrative reconstruction and fine-grained entity tracking.

Notably, the system achieved outstanding performance in the challenging reasoning category, reaching an accuracy of 80.1%, highlighting the effectiveness of structured, multimodal reasoning. Measurements confirm that the hierarchical indexing allows for multi-granularity retrieval, enabling the agent to navigate and access information at different levels of detail. Researchers constructed a hierarchical information database to organise video content, allowing for effective information access and scalable reasoning. This database spans multiple levels, global summary, scene, segment, and entity, providing a comprehensive representation of the video’s content.

The breakthrough delivers a cross-modal entity integration mechanism that maintains semantic consistency across time and modality, improving entity tracking and narrative coherence. Tests prove that by dynamically querying and reasoning over this hierarchy, the agentic search mechanism enables a holistic understanding of long videos, bridging fragmented information and preserving semantic consistency. Furthermore, the study validated the framework on long video understanding benchmarks, showcasing superior performance compared to existing baselines. The work introduces an audiovisual entity cohesion mechanism, maintaining semantic consistency and improving entity tracking throughout the video. Data shows that this approach effectively addresses the limitations of previous methods, which often struggle to balance global coherence with local detail and maintain entity continuity over time. This research paves the way for more advanced and accurate long-video understanding systems with applications in entertainment, education, and surveillance.

HAVEN excels at long-video reasoning tasks

Scientists have developed HAVEN, a new framework designed to improve long-video understanding for vision-language models. The system addresses the challenges posed by lengthy videos by integrating audiovisual entity cohesion and hierarchical video indexing with an agentic search mechanism, allowing for more coherent and comprehensive reasoning. This approach structures video content across global, scene, segment, and entity levels, preserving semantic consistency over extended durations. Extensive experimentation on the LVBench dataset demonstrates that HAVEN significantly outperforms existing state-of-the-art methods, achieving an overall accuracy of 84.1% and 80.1% in the challenging reasoning category.

These results highlight the effectiveness of structured, multimodal reasoning for long-form video comprehension, suggesting that hierarchical indexing and multimodal entity integration are key to improved performance. The authors acknowledge limitations inherent in relying on pre-trained models and the computational demands of processing long videos. Future research could explore more efficient indexing techniques and investigate the framework’s adaptability to diverse video genres and tasks.

👉 More information
🗞 Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
🧠 ArXiv: https://arxiv.org/abs/2601.13719

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Control Methods Gain Stability Against Hardware Errors with New Optimisation Technique

Mathematical Analysis Confirms a Long-Standing Conjecture About Special Function Values

February 14, 2026
Quantum Architecture Shrinks Computing Needs to under 100 000 Qubits

Machine Learning Now Personalises Treatment Effects from Complex, Continuous Data

February 14, 2026
Researchers Develop Systems Equating 2 Diagram Classes with Verified Term Rewriting Rules

Researchers Develop Systems Equating 2 Diagram Classes with Verified Term Rewriting Rules

February 14, 2026