Researchers are increasingly focused on evaluating the perceptual abilities of large multimodal models (LMMs). However, one critical aspect—video aesthetics understanding—has received comparatively limited attention. While prior work has largely emphasized object recognition, action understanding, and semantic reasoning, the nuanced evaluation of visual form, style, and affect remains underexplored.
To address this gap, Yunhao Li, Sijing Wu, and Zhilin Gao from Shanghai Jiao Tong University, together with Zicheng Zhang from Shanghai AI Laboratory, Qi Jia, Huiyu Duan, and colleagues, introduce VideoAesBench, a benchmark specifically designed to rigorously assess LMMs’ comprehension of video aesthetic quality. This work moves beyond surface-level perception and establishes a structured framework for evaluating aesthetic judgment in videos, an ability increasingly relevant to creative, social, and entertainment-oriented AI applications.
VideoAesBench Dataset and Design
VideoAesBench comprises 1,804 videos curated from a diverse set of sources, including user-generated content (UGC), AI-generated content (AIGC), robotic-generated content (RGC), compressed videos, and game footage. By incorporating both high- and low-quality examples across multiple creation methods, the benchmark enables a comprehensive assessment of how LMMs perceive aesthetics under varied visual and technical conditions.
A key contribution of VideoAesBench lies in its multifaceted question design. In addition to conventional single-choice and multiple-choice questions, the benchmark includes True or False questions and a novel open-ended descriptive format aimed at eliciting detailed explanations of video aesthetics. To ensure clarity and reduce ambiguity, annotators carefully constructed questions so that aesthetic descriptions focused on salient visual attributes and avoided irrelevant details.
Holistic Aesthetic Dimensions
The benchmark organises video aesthetics into three high-level dimensions:
-
Visual Form, encompassing five aspects related to composition, structure, and spatial organisation
-
Visual Style, covering four aspects such as colour usage, texture, and stylistic coherence
-
Visual Affectiveness, consisting of three aspects capturing emotional tone and expressive impact
This structured taxonomy allows for fine-grained evaluation of LMMs’ aesthetic perception, empathy, and interpretative reasoning, rather than relying on coarse accuracy metrics alone.
Experimental Evaluation and Findings
Using VideoAesBench, the researchers benchmarked 23 open-source and commercial large multimodal models. Experimental results indicate that, despite recent progress, current LMMs exhibit only rudimentary video aesthetic perception abilities. Overall performance remains incomplete and imprecise, particularly when models are required to make nuanced judgments involving subtle composition, colour harmony, or emotional expression.
The findings reveal notable performance variation across models and video types. Larger models generally demonstrate stronger aesthetic understanding, suggesting that model scale and training diversity play an important role. Certain models perform better on specific categories—for example, some show relative strength in evaluating user-generated or game videos, while others perform better on compressed content—highlighting an imbalance in current LMM capabilities.
Importantly, even top-performing models struggle with complex aesthetic scenarios, especially when visual quality is degraded or when artistic intent is subtle. These limitations underscore the difficulty of translating human aesthetic sensibility into automated systems.
Implications and Future Directions
VideoAesBench establishes a robust testbed for future research into explainable video aesthetics assessment. By integrating diverse video sources, structured aesthetic dimensions, and multiple evaluation formats, the benchmark provides valuable insights into where current models succeed and where they fall short.
The authors emphasise that future work should focus on developing LMMs with more consistent and generalised aesthetic understanding across video types. Addressing biases in training data, improving cross-domain generalisation, and enhancing models’ ability to articulate aesthetic reasoning are key challenges moving forward.
Beyond academic evaluation, improved video aesthetic perception has practical implications for applications such as content recommendation, video editing, creative assistance, social media moderation, and AI-assisted content creation. As AI systems increasingly interact with human creativity, benchmarks like VideoAesBench will be essential for guiding progress toward more perceptive and artistically aware models.
👉 More information
🗞 VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models
🧠 ArXiv: https://arxiv.org/abs/2601.21915
