Omni-captioner Pipeline and Models Advance Detailed Perception, Addressing Hallucination in Omni Language Models

Advancing human-computer interaction requires artificial intelligence that perceives and understands the world with increasing subtlety, and researchers are now focusing on ‘omni detailed perception’, the ability to process both audio and visual information with exceptional granularity. Ziyang Ma, Ruiyang Xu, and Zhenghao Xing, alongside colleagues Yunfei Chu, Yuxuan Wang, and Jinzheng He, present a comprehensive investigation into this challenging area, addressing limitations in current audio-visual models that struggle to capture fine-grained details without generating inaccurate information. The team developed Omni-Detective, a novel data generation pipeline that autonomously creates highly detailed and reliable multimodal data, and used this to train advanced captioning models, achieving state-of-the-art performance on existing benchmarks and surpassing leading commercial models like Gemini 2. 5 Flash. Crucially, recognising the lack of suitable evaluation tools, they also designed Omni-Cloze, a new benchmark that provides a stable and efficient method for assessing detailed audio, visual, and audio-visual captioning, representing a significant step forward in the pursuit of truly perceptive artificial intelligence.

ion. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and accurately describe fine-grained details remains limited. In this work, researchers present a systematic and comprehensive investigation of omni detailed perception, examining the data used to train these models, the models themselves, and the benchmarks used to evaluate them. They identified a key challenge: as models generate longer, more detailed captions, they also tend to introduce more inaccuracies, a phenomenon termed “co-growth” between detail and hallucination.

Cloze Test Analysis of Video Content

Researchers evaluated a model’s ability to understand video content using a cloze test, a fill-in-the-blank exercise designed to assess contextual understanding. The analysis focused on a series of video descriptions, including tours of restaurants and scenes from electronic music videos. Each description was followed by cloze test questions, and the model’s answers were carefully analysed. The results demonstrate a high level of accuracy, with the model correctly answering the vast majority of questions. Errors tended to fall into a few categories, such as lacking specificity or struggling with subjective interpretation like musical timbre.

The model occasionally missed subtle cues in the video description, leading to incorrect predictions, but consistently excelled at identifying concrete details like colours, objects, and basic actions, while struggling more with abstract or artistic concepts. The analysis revealed strong performance across different video types, including restaurant tours, electronic music videos, and scenes inspired by Dragon Ball Z. Overall, the model demonstrates a strong ability to understand context and predict missing information in video descriptions. This analysis provides valuable insights into the model’s strengths and weaknesses, which can be used to guide further development and improvement.

Detail Gain Without Hallucination Growth

Recent advances in Omni Language Models have enabled machines to produce increasingly rich descriptions of audio-visual scenes, yet a challenge exists in capturing fine-grained details without introducing inaccuracies. Researchers identified a “co-growth” phenomenon where longer captions, while containing more detail, also exhibit a rise in hallucinated content. To address this, the team designed Omni-Detective, an agentic data pipeline where an LLM agent iteratively gathers evidence using tool-calling and modality-specific observers, explicitly targeting the decoupling of detail gain from hallucination growth. This pipeline yields detailed caption datasets with minimal noise, which were then used to train Audio-Captioner and Omni-Captioner using a two-stage curriculum.

Initially, the visual encoder was frozen to ensure precise alignment with sparse audio cues, followed by joint optimization of both modalities to produce coherent, cross-modal narratives. Experiments demonstrate the effectiveness of this approach, with Omni-Captioner achieving a new state-of-the-art performance on the VDC benchmark and the best trade-off between detail coverage and hallucination on the video-SALMONN 2 testset. Furthermore, Audio-Captioner attained the best results on both MMAU and MMAR benchmarks, surpassing all open-source models and achieving performance comparable to Gemini 2. 5 Flash, while Omni-Captioner achieved the highest overall score on Video-MME and Video-Holmes. These results demonstrate a significant advancement in omni-modal detailed perception, delivering models capable of generating richly detailed and factually accurate descriptions of complex audio-visual scenes.

Detailed Multimodal Perception with Omni-Detective

This work presents a comprehensive framework for advancing detailed perception in systems processing multiple modalities, such as audio and video. Researchers addressed limitations in current models’ ability to capture fine-grained details by introducing Omni-Detective, a data generation pipeline that autonomously creates highly detailed and accurate captions. Leveraging the data produced by Omni-Detective, the team trained two captioning models, Audio-Captioner and Omni-Captioner, achieving state-of-the-art results on established benchmarks and demonstrating a superior balance between detailed description and avoidance of hallucination. Furthermore, recognising the need for robust evaluation, the researchers designed Omni-Cloze, a novel benchmark specifically for assessing detailed multimodal perception.

This cloze-style evaluation, covering audio, visual, and combined inputs, provides a stable and reliable method for measuring a model’s ability to understand and describe complex scenes. The team highlights that Omni-Cloze’s design, which focuses on information extraction, contributes to its consistent and trustworthy assessment of model performance. While acknowledging the progress made, the authors suggest that future work should focus on developing even more reliable and fine-grained multimodal perception systems and extending evaluation protocols to transparently reflect model capabilities.

👉 More information
🗞 Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
🧠 ArXiv: https://arxiv.org/abs/2510.12720

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Renormalization Group Flow Irreversibility Enables Constraints on Effective Spatial Dimensionality

Renormalization Group Flow Irreversibility Enables Constraints on Effective Spatial Dimensionality

December 20, 2025
Replica Keldysh Field Theory Unifies Quantum-Jump Processes in Bosonic and Fermionic Systems

Replica Keldysh Field Theory Unifies Quantum-Jump Processes in Bosonic and Fermionic Systems

December 20, 2025
Quantum Resource Theory Achieves a Unified Operadic Foundation with Multicategorical Adjoints

Quantum Resource Theory Achieves a Unified Operadic Foundation with Multicategorical Adjoints

December 20, 2025