Recognising human emotion is a complex challenge, yet crucial for creating truly intelligent systems, and researchers are now pushing the boundaries of what’s possible with advanced artificial intelligence. Jing Han from the University of Cambridge, Zhiqiang Gao and Shihao Gao from Hunan University, and their colleagues, present the first large-scale evaluation of how well large multimodal models, those capable of processing text, audio and video, understand emotion in real-world scenarios. This work moves beyond simply identifying a limited set of emotions, instead tackling the more difficult task of recognising a vast and open range of emotional expressions, and establishes essential benchmarks for the field. By systematically testing 19 leading models, the team reveals that combining audio, video and text achieves the best results, with video proving particularly important, and demonstrates that open-source models are surprisingly competitive with their closed-source counterparts, offering valuable insights for developing more sophisticated and nuanced emotion recognition technologies.
Google, Alibaba, and Multimodal LLM Advances
Recent developments showcase significant progress in Large Language Models (LLMs) and multimodal AI, with models like Gemini from Google and Qwen from Alibaba leading the way. Gemini, a family of models processing text, image, audio, and video, is designed to function as an agentic AI, while Gemma offers open-source language models prioritizing practicality. Alibaba’s Qwen series includes models for both audio and language, with Qwen2.5 detailing performance improvements, and DeepSeek models utilize reinforcement learning to enhance reasoning capabilities. A key theme in current research is the development of prompting techniques to improve LLM performance, particularly in reasoning.
Strategies such as chain-of-thought prompting, self-consistency, self-refine, and least-to-most prompting are being explored to elicit and refine the models’ thought processes. Direct Preference Optimization, a reinforcement learning technique, is also being applied to refine model responses. These advancements extend beyond text, with research focusing on multimodal understanding, specifically video and audio processing. Models like LLaVA-Video and Tarsier2 are advancing video understanding, while Qwen-Audio provides unified audio-language processing. Researchers are also investigating methods to enhance temporal understanding in video LLMs and scale the performance of open-source multimodal models. Furthermore, efforts are underway to improve the explainability of AI systems, including the use of sonification to make them more user-friendly. These models find application in areas like medical education and emotion recognition, demonstrating the diverse potential of this rapidly evolving field.
Multimodal LLMs Benchmarked for Open-Vocabulary Emotion Recognition
A recent study presents a large-scale benchmark investigation of nineteen mainstream multimodal large language models (MLLMs) for open-vocabulary emotion recognition. Researchers constructed a comprehensive evaluation framework using the OV-MERD dataset to analyze model reasoning, fusion strategies, and prompt design, revealing both the capabilities and limitations of current MLLMs in understanding nuanced emotions. The research builds upon a previous emotional clue-based method, extending it with novel architectures for improved performance. Experiments demonstrate that a two-stage trimodal fusion approach, integrating audio, video, and text, achieves optimal performance.
Video emerges as the most critical modality for accurate emotion assessment, significantly impacting results more than audio or text alone. Detailed analysis of prompt engineering reveals a surprisingly narrow performance gap between open-source and closed-source LLMs, establishing essential benchmarks and offering practical guidelines for advancing open-vocabulary and fine-grained affective computing. These findings lay the groundwork for more nuanced and interpretable emotion systems capable of capturing complex emotional descriptions.
Trimodal Fusion Excels in Emotion Recognition
Comprehensive benchmarking of large language models (MLLMs) in open-vocabulary emotion recognition reveals the benefits of multimodal fusion. Scientists evaluated nineteen models using the OV-MERD dataset, which contains a rich vocabulary of emotion terms. Results confirm that integrating audio, video, and text achieves optimal performance, surpassing both one-stage and bimodal configurations. Video proves to be the most impactful modality, significantly enhancing accuracy. The emotional clue-based two-stage method achieves the highest performance, leveraging the reasoning capabilities of audio and video LLMs to extract salient emotional clues, then synthesizing them with text-based LLMs. The study demonstrates a surprisingly small performance difference between open-source and closed-source models, establishing essential benchmarks and offering practical guidelines for advancing fine-grained affective computing. These findings pave the way for more nuanced emotion systems capable of recognizing a wider range of emotional states.
Multimodal Emotion Recognition Reveals Key Insights
Recent research provides a comprehensive investigation into open-vocabulary multimodal emotion recognition, evaluating nineteen mainstream models across text, audio, and video processing. Results demonstrate that combining information from all three modalities yields the best performance, with video proving to be the most informative source for accurate emotion assessment. The study reveals a surprisingly small performance difference between leading open-source and closed-source language models, although video-specialized closed-source models show advantages with increased size. Carefully designed prompts consistently improve results, yet complex reasoning models do not necessarily outperform simpler approaches for this specific application. Researchers acknowledge that the models’ advanced reasoning skills may be unnecessarily complex for direct emotion identification, suggesting future work should focus on more comprehensive datasets, multilingual evaluations, and sophisticated multimodal fusion techniques to further refine emotion understanding in artificial intelligence.
👉 More information
🗞 Pioneering Multimodal Emotion Recognition in the Era of Large Models: From Closed Sets to Open Vocabularies
🧠 ArXiv: https://arxiv.org/abs/2512.20938
