Researchers are increasingly focused on assessing the capabilities of multimodal large language models, but current benchmarks largely overlook their ability to process dynamic audio-visual information. To address this, Ahmed Y. Radwan from the Vector Institute for Artificial Intelligence, Christos Emmanouilidis from the University of Groningen, Hina Tabassum from York University, et al., have introduced SONIC-O1, a new, fully human-verified benchmark comprising nearly 5,000 annotations across 13 real-world conversational scenarios. This work is significant because it provides a robust, real-world evaluation of MLLMs on tasks requiring open-ended summarisation, question answering, and crucially, temporal reasoning , revealing substantial performance gaps between closed and open-source models and highlighting persistent biases across demographic groups.
Prior research has largely focused on static image understanding, leaving a significant gap in the assessment of MLLMs’ ability to process sequential audio-video inputs, which this new benchmark directly addresses. The study unveils a dataset spanning 13 real-world conversational domains, comprising 4,958 annotations alongside detailed demographic metadata, enabling nuanced analysis of model behaviour across diverse groups. SONIC-O1 assesses MLLMs on three key tasks: open-ended summarization, multiple-choice question answering, and temporal localization with supporting rationales, demanding both comprehension and reasoning capabilities from the models.
Experiments conducted on both closed- and open-source models reveal notable limitations in current MLLM performance. While the accuracy gap in multiple-choice question answering between the two model families remains relatively modest, the research establishes a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models, highlighting a significant disparity in their ability to pinpoint events within the audio-video stream. Further investigation demonstrates that performance degrades consistently across different demographic groups, indicating persistent biases and disparities in model behaviour that require attention. This finding underscores the need for socially robust and equitable AI systems, particularly in sensitive applications.
The team achieved a robust evaluation suite for temporally grounded and socially robust multimodal understanding by meticulously annotating a diverse range of real-world interactions, including patient-doctor consultations, job interviews, and emergency response scenarios. This work opens new avenues for assessing MLLMs beyond static image analysis, pushing the boundaries of their capabilities in understanding dynamic, real-world conversations. The researchers released SONIC-O1 publicly, providing access to the dataset, code, project page, and leaderboard to facilitate reproducibility and encourage further research in the field. This open-source approach fosters collaboration and accelerates progress towards building more reliable and equitable MLLMs.
This breakthrough reveals the importance of evaluating MLLMs not only on accuracy but also on fairness and temporal reasoning, particularly as these models are increasingly deployed in high-stakes applications. The detailed demographic metadata included in SONIC-O1 allows for granular analysis of model performance across different groups, enabling researchers to identify and mitigate potential biases. Experiments using LLM-judge scores across 13 conversational domains, comparing models like Gemini-3.0-Pro, Qwen3-Omni, and UniMoE-2, demonstrate the potential for improvement in video summarization and other key tasks, paving the way for more effective and responsible AI systems.
SONIC-O1 Benchmark for Multimodal Temporal Understanding evaluates reasoning
Researchers developed SONIC-O1, a comprehensive benchmark comprising 4,958 fully human-verified annotations spanning 13 real-world conversational domains to rigorously evaluate Multimodal Large Language Models (MLLMs). The study pioneered a methodology for assessing MLLM performance on temporally grounded tasks, specifically open-ended summarization, multiple-choice question answering, and temporal localization with supporting rationales. Data collection involved meticulous annotation of audio-video segments, incorporating demographic metadata to facilitate analysis of potential biases. This approach enables detailed investigation into model behaviour across diverse population groups.
To construct SONIC-O1, the team systematically gathered conversational data from real-world scenarios, ensuring a broad representation of everyday interactions. Annotators then meticulously labelled each segment, providing detailed summaries, formulating multiple-choice questions with correct answers, and pinpointing the precise temporal locations relevant to each question. The research employed a rigorous quality control process, involving multiple annotators per segment and adjudication to ensure annotation consistency and accuracy. This process yielded a high-quality dataset suitable for robust MLLM evaluation.
Experiments were conducted utilising both closed-source and open-source MLLMs, assessing their performance across the defined tasks. The team measured MCQ accuracy and, crucially, temporal localization performance, calculating the difference between models to quantify their ability to accurately identify relevant moments in the audio-video data. This temporal localization assessment involved evaluating the alignment between model-generated rationales and the ground truth timestamps, providing a nuanced understanding of reasoning capabilities. A substantial 22.6% performance difference in temporal localization was observed between the best performing closed-source and open-source models.
Further analysis focused on identifying performance disparities across demographic groups, revealing persistent biases in model behaviour. The study harnessed linguistic analysis tools, such as LIWC-22, to examine the language used in both the input data and model outputs, seeking correlations between demographic factors and model responses. This innovative application of linguistic inquiry provided insights into potential sources of bias and areas for improvement in model fairness. The release of SONIC-O1, alongside its associated dataset and code, facilitates reproducibility and encourages further research into temporally grounded and socially robust MLLMs.
SONIC-O1 benchmark reveals MLLM performance disparities across different
Scientists have introduced SONIC-O1, a new benchmark designed to rigorously evaluate multimodal large language models (MLLMs) on real-world audio-video data. The research addresses a gap in existing benchmarks, which largely focus on static image understanding, by providing a comprehensive dataset with 4,958 fully human-verified annotations and demographic metadata spanning 13 conversational domains. Experiments were conducted across key tasks including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales. Results demonstrate a substantial performance difference of 22.6% in temporal localization between the best performing closed-source and open-source models.
SONIC-O1 reveals MLLM limitations and biases in complex
Scientists have introduced SONIC-O1, a new benchmark designed to evaluate the performance of large multimodal language models (MLLMs) on real-world audio-video data. This benchmark comprises 4,958 human-verified annotations across 13 conversational domains, assessing models on tasks such as open-ended summarisation, multiple-choice question answering, and temporal localisation with supporting reasoning. Experiments utilising both closed- and open-source models revealed considerable limitations in temporal localisation, with a 22.6% performance difference between the best closed-source and open-source models. Furthermore, performance disparities were observed across different demographic groups, suggesting potential biases in model behaviour.
The research highlights that audio or transcripts frequently offer the most significant cues for comprehension, while accurately identifying timing remains a substantial challenge. The authors acknowledge several limitations within the study, including a potential reliance on relative playback position rather than robust temporal representations and unequal sample sizes across demographic groups. They also note that context-length limitations necessitated video segmentation for many models, potentially amplifying temporal errors. Future research could benefit from addressing these limitations and expanding benchmark coverage to encompass a wider range of scenarios. Overall, SONIC-O1 offers a valuable, open-access evaluation suite for temporally grounded and socially robust understanding in MLLMs, providing a practical testbed for evaluating these models in realistic audio-video settings and guiding future work towards improved temporal reasoning and more equitable multimodal evaluation.
👉 More information
🗞 SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding
🧠 ArXiv: https://arxiv.org/abs/2601.21666
