Advances in Emotion AI, Enabling Open-Vocabulary Understanding with Large Models

Recognising human emotion is a complex challenge, yet crucial for creating truly intelligent systems, and researchers are now pushing the boundaries of what’s possible with advanced artificial intelligence. Jing Han from the University of Cambridge, Zhiqiang Gao and Shihao Gao from Hunan University, and their colleagues, present the first large-scale evaluation of how well large multimodal models, those capable of processing text, audio and video, understand emotion in real-world scenarios. This work moves beyond simply identifying a limited set of emotions, instead tackling the more difficult task of recognising a vast and open range of emotional expressions, and establishes essential benchmarks for the field. By systematically testing 19 leading models, the team reveals that combining audio, video and text achieves the best results, with video proving particularly important, and demonstrates that open-source models are surprisingly competitive with their closed-source counterparts, offering valuable insights for developing more sophisticated and nuanced emotion recognition technologies.

Google, Alibaba, and Multimodal LLM Advances

Recent developments showcase significant progress in Large Language Models (LLMs) and multimodal AI, with models like Gemini from Google and Qwen from Alibaba leading the way. Gemini, a family of models processing text, image, audio, and video, is designed to function as an agentic AI, while Gemma offers open-source language models prioritizing practicality. Alibaba’s Qwen series includes models for both audio and language, with Qwen2.5 detailing performance improvements, and DeepSeek models utilize reinforcement learning to enhance reasoning capabilities. A key theme in current research is the development of prompting techniques to improve LLM performance, particularly in reasoning.

Strategies such as chain-of-thought prompting, self-consistency, self-refine, and least-to-most prompting are being explored to elicit and refine the models’ thought processes. Direct Preference Optimization, a reinforcement learning technique, is also being applied to refine model responses. These advancements extend beyond text, with research focusing on multimodal understanding, specifically video and audio processing. Models like LLaVA-Video and Tarsier2 are advancing video understanding, while Qwen-Audio provides unified audio-language processing. Researchers are also investigating methods to enhance temporal understanding in video LLMs and scale the performance of open-source multimodal models. Furthermore, efforts are underway to improve the explainability of AI systems, including the use of sonification to make them more user-friendly. These models find application in areas like medical education and emotion recognition, demonstrating the diverse potential of this rapidly evolving field.

Multimodal LLMs Benchmarked for Open-Vocabulary Emotion Recognition

A recent study presents a large-scale benchmark investigation of nineteen mainstream multimodal large language models (MLLMs) for open-vocabulary emotion recognition. Researchers constructed a comprehensive evaluation framework using the OV-MERD dataset to analyze model reasoning, fusion strategies, and prompt design, revealing both the capabilities and limitations of current MLLMs in understanding nuanced emotions. The research builds upon a previous emotional clue-based method, extending it with novel architectures for improved performance. Experiments demonstrate that a two-stage trimodal fusion approach, integrating audio, video, and text, achieves optimal performance.

Video emerges as the most critical modality for accurate emotion assessment, significantly impacting results more than audio or text alone. Detailed analysis of prompt engineering reveals a surprisingly narrow performance gap between open-source and closed-source LLMs, establishing essential benchmarks and offering practical guidelines for advancing open-vocabulary and fine-grained affective computing. These findings lay the groundwork for more nuanced and interpretable emotion systems capable of capturing complex emotional descriptions.

Trimodal Fusion Excels in Emotion Recognition

Comprehensive benchmarking of large language models (MLLMs) in open-vocabulary emotion recognition reveals the benefits of multimodal fusion. Scientists evaluated nineteen models using the OV-MERD dataset, which contains a rich vocabulary of emotion terms. Results confirm that integrating audio, video, and text achieves optimal performance, surpassing both one-stage and bimodal configurations. Video proves to be the most impactful modality, significantly enhancing accuracy. The emotional clue-based two-stage method achieves the highest performance, leveraging the reasoning capabilities of audio and video LLMs to extract salient emotional clues, then synthesizing them with text-based LLMs. The study demonstrates a surprisingly small performance difference between open-source and closed-source models, establishing essential benchmarks and offering practical guidelines for advancing fine-grained affective computing. These findings pave the way for more nuanced emotion systems capable of recognizing a wider range of emotional states.

Multimodal Emotion Recognition Reveals Key Insights

Recent research provides a comprehensive investigation into open-vocabulary multimodal emotion recognition, evaluating nineteen mainstream models across text, audio, and video processing. Results demonstrate that combining information from all three modalities yields the best performance, with video proving to be the most informative source for accurate emotion assessment. The study reveals a surprisingly small performance difference between leading open-source and closed-source language models, although video-specialized closed-source models show advantages with increased size. Carefully designed prompts consistently improve results, yet complex reasoning models do not necessarily outperform simpler approaches for this specific application. Researchers acknowledge that the models’ advanced reasoning skills may be unnecessarily complex for direct emotion identification, suggesting future work should focus on more comprehensive datasets, multilingual evaluations, and sophisticated multimodal fusion techniques to further refine emotion understanding in artificial intelligence.

👉 More information
🗞 Pioneering Multimodal Emotion Recognition in the Era of Large Models: From Closed Sets to Open Vocabularies
🧠 ArXiv: https://arxiv.org/abs/2512.20938

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Gputb-2 Achieves Higher Accuracy for Electronic Structure Calculations with N^3 Scaling

Quantum Dots Achieve 0.7 Energy Shifts Via Phononic Crystal Waveguide Coupling

January 21, 2026
Rigetti Secures $8.4M Order for 108-Qubit Quantum Computer with C-DAC

Gputb-2 Achieves Higher Accuracy for Electronic Structure Calculations with N^3 Scaling

January 21, 2026
Pbs Quantum Dot-rGO Hybrids Achieve 94% Charge Transfer Efficiency

Twisted Graphene Achieves Tunable Quantum Anomalous Hall Effect with Chern Number 5

January 21, 2026