The ability of artificial intelligence to understand and respond to human emotions, cultural context, and ethical considerations remains a significant challenge, despite advances in large language models. Jiaxin Liu, Peiyi Tu, and Wenyu Chen, from their respective institutions, lead a team that addresses this gap with HeartBench, a novel framework for evaluating anthropomorphic intelligence in models designed for the Chinese language. This benchmark moves beyond simple reasoning tasks to assess a model’s capacity for emotional, cultural, and ethical understanding, using realistic psychological counseling scenarios vetted by clinical experts. The research reveals a considerable performance gap in current state-of-the-art language models, achieving only 60% of the ideal score, and highlights particular difficulty with nuanced emotional cues and complex ethical dilemmas, establishing a crucial new standard for measuring and improving truly human-like AI.
Recognizing a deficit in anthropomorphic intelligence within current LLMs, researchers developed HeartBench, a novel framework designed to rigorously evaluate the emotional, cultural, and ethical intelligence of Large Language Models (LLMs) specifically within the Chinese linguistic context.
This benchmark moves beyond traditional cognitive assessments to address the nuanced demands of emotionally and culturally sensitive applications like AI companionship and digital mental health. HeartBench’s structure centers on a theory-driven taxonomy encompassing five primary dimensions and 15 secondary capabilities, providing a comprehensive assessment of human-like traits. The method employs a case-specific, rubric-based methodology, translating abstract qualities into granular, measurable criteria, and utilizes a “reasoning-before-scoring” evaluation protocol to ensure thorough analysis.
Researchers assessed 13 state-of-the-art LLMs using HeartBench, revealing a substantial performance ceiling, with even leading models achieving only 60% of the expert-defined ideal score. Further analysis utilized a difficulty-stratified “Hard Set” of scenarios, demonstrating significant performance decay in cases involving subtle emotional cues and complex ethical considerations. This detailed evaluation highlights the challenges LLMs face in navigating complex social dynamics.
The work establishes a standardized metric for anthropomorphic AI evaluation and delivers a methodological blueprint for constructing high-quality, human-aligned training data, ultimately aiming to cultivate models with a more profound and humanistic intelligence.
HeartBench Evaluates LLM Emotional Intelligence in Chinese
Scientists have developed HeartBench, a new framework for evaluating the emotional, cultural, and ethical intelligence of large language models (LLMs) specifically within the Chinese linguistic context. Recognizing a gap in existing benchmarks, the team constructed HeartBench around a detailed theory-driven taxonomy encompassing five primary dimensions and 15 secondary capabilities, grounded in authentic psychological counseling scenarios and validated by clinical experts.
The method employs a “reasoning-before-scoring” protocol, translating abstract human traits into measurable criteria for rigorous assessment. Experiments involving 13 state-of-the-art LLMs reveal a substantial performance ceiling, with even the leading models achieving only 60% of the expert-defined ideal score. Further analysis using a difficulty-stratified “Hard Set” of scenarios demonstrates a significant performance decline, confirming the Hard Set’s ability to isolate challenging nuances beyond simple pattern recognition.
Detailed evaluation across dimensions reveals specific limitations, notably in humor comprehension, where models frequently adopt literal interpretations of jokes and sarcasm, and in emotional intelligence, where they exhibit over-accommodation tendencies. Measurements show models struggle with curiosity, often providing generic suggestions, and frequently employ didactic communication styles0.5-Sonnet demonstrated the most stability, while Gemini-3-pro-preview experienced a significant drop in performance, indicating a reliance on “social specialization” rather than robust reasoning.
These results establish HeartBench as a standardized metric for anthropomorphic evaluation and a blueprint for creating high-quality training data, paving the way for more sophisticated, human-centric LLMs.
HeartBench Measures Chinese Socio-Emotional Intelligence
The researchers developed HeartBench, a new framework for evaluating how well large language models demonstrate anthropomorphic intelligence, the ability to understand and respond appropriately to complex emotional, cultural, and ethical situations, specifically within a Chinese context. This framework moves beyond simple assessments of linguistic ability and instead focuses on socio-emotional resonance, using a detailed “reasoning-before-scoring” rubric grounded in clinical psychology and anthropology.
By translating abstract human traits into measurable criteria, HeartBench provides a systematic way to assess a model’s capacity for nuanced understanding. Evaluation of thirteen leading language models using HeartBench reveals a significant performance ceiling, with even the most advanced models achieving only 60% of the score defined as ideal by human experts. The data indicates particular weaknesses in areas such as curiosity and ethical autonomy, suggesting these capabilities require dedicated development and alignment.
Importantly, the framework demonstrates high reliability, with an 87% agreement rate between automated scores and evaluations from psychological experts, establishing a trustworthy metric for assessing these complex qualities. The authors acknowledge that HeartBench is currently focused on the Chinese linguistic and cultural context, and further work is needed to adapt the framework for other languages and cultures.
Future research could explore the creation of larger, more diverse datasets to further train and refine models in these crucial areas of anthropomorphic intelligence, ultimately fostering more human-centered artificial intelligence systems.
👉 More information
🗞 HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs
🧠 ArXiv: https://arxiv.org/abs/2512.21849
