Determining video similarity presents a complex challenge, as humans naturally assess videos based on multiple factors, such as actions and locations, rather than a single overall impression. Benedetta Liberatori, Alessandro Conti, and Lorenzo Vaquero from University of Trento, alongside Yiming Wang and Elisa Ricci from Fondazione Bruno Kessler (FBK), address this issue by introducing a new task and benchmark called Concept-based Video Similarity (ConViS). The team proposes that comparing videos through the lens of specific semantic concepts, such as ‘cooking’ or ‘beach’, more closely mirrors human reasoning and enables more nuanced understanding. To facilitate this approach, they created ConViS-Bench, a comprehensive dataset of video pairs annotated with concept-level similarity scores and descriptive explanations, and demonstrate that current video understanding models exhibit varying abilities to accurately assess similarity across different concepts, paving the way for significant advances in video analysis and retrieval.
It moves beyond simply recognizing what happens in a video to understanding how videos relate to each other based on specific criteria. The benchmark presents pairs of videos and asks models to score them based on key aspects, including the main action, the subjects involved, the objects present, the location, and the order of events. Each criterion receives a score, allowing for a detailed comparison of video relationships. The analysis also provides tags that highlight specific similarities and differences between videos, offering crucial insight into why a model assigns a particular score.
Results demonstrate the benchmark’s ability to perform fine-grained analysis, pinpointing exactly what aspects videos share or diverge on. The system demonstrates contextual understanding, recognizing nuanced relationships like the similarity between videos of cooking, even with different ingredients or techniques. It can even capture subtle differences, such as variations in yoga poses. The detailed tags provide valuable insights into the reasoning behind the scores. ConViS-Bench offers a valuable tool for researchers working on video understanding, action recognition, and similarity search. It could be used to build more accurate video retrieval systems, identify redundant clips for editing, or even moderate content by identifying videos similar to harmful content. This task quantifies how similar videos are across specific semantic concepts, mirroring human perception. Each video pair receives human-annotated similarity scores for five broad concepts, alongside free-form textual descriptions detailing both similarities and differences, providing a rich dataset for analysis. The average duration of videos within ConViS-Bench is 28. 2 seconds, offering substantial content for detailed comparison.
To establish the validity of ConViS, scientists compared it to existing video datasets, including StepDiff and VidDiffBench, revealing key distinctions in focus and annotation style. Unlike these prior datasets, which concentrate on limited domains or specific aspects like cooking actions, ConViS-Bench emphasizes broad conceptual comparisons across a wider range of video types. The annotation process involved humans assigning similarity scores to video pairs for each concept, expressed in natural language, allowing for flexible and nuanced evaluation. This work addresses the challenge of how to compare videos in a way that reflects human cognition, where similarity judgements depend on which aspects, such as activity, location, or the order of actions, are being prioritized. To support this research, the team released ConViS-Bench, a new benchmark dataset comprising video pairs spanning diverse domains, each annotated with human-evaluated similarity scores for multiple semantic concepts and accompanied by textual descriptions of both similarities and differences. Extensive benchmarking of recent Large Multimodal Models (LMMs) on ConViS-Bench reveals significant performance differences across various concepts.
The analysis demonstrates that while some models reliably identify visual similarities, they consistently struggle with more abstract notions like the temporal structure of events, a limitation also noted in previous studies. Researchers found that certain concepts present greater challenges for models to judge in terms of video similarity, highlighting areas for improvement in video understanding capabilities. To support this, the team created ConViS-Bench, a dataset of video pairs annotated with concept-level similarity scores and descriptions of both similarities and differences. Extensive testing of current large multimodal models on ConViS-Bench reveals significant variations in performance, demonstrating that some concepts are more challenging for computers to assess than others.
This highlights a bias towards certain types of representations and underscores the need for methods specifically designed to understand concepts within videos. By enabling evaluation and retrieval based on these concepts, the research offers a pathway towards more interpretable and user-aligned video understanding systems. The authors acknowledge that the current concept set, while grounded in cognitive science, may not capture all domain-specific nuances of video similarity and suggest expanding the dataset to further enhance its value.
👉 More information
🗞 ConViS-Bench: Estimating Video Similarity Through Semantic Concepts
🧠 ArXiv: https://arxiv.org/abs/2509.19245
