Researchers are tackling the complex challenge of assessing unified multimodal models, artificial intelligence systems capable of generating both images and text. Bo Li, Yida Yin, and Wenhao Chai, from their respective institutions, alongside Xingyu Fu and Zhuang Liu et al., introduce UEval, a new benchmark comprising 1,000 expertly curated questions designed to rigorously test these models across eight real-world tasks. Unlike previous evaluation methods, UEval employs a rubric-based scoring system , built from over 10,400 validated criteria generated with both machine learning and human expertise , to provide a more nuanced and scalable assessment of multimodal generation quality. This research is significant because current leading models, including GPT-5-Thinking, struggle with UEval, achieving only 66.4 out of 100, highlighting a critical need for advancements in reasoning capabilities within unified multimodal AI?

This work addresses a critical gap in current evaluation methods, which largely focus on either visual question answering or text-to-image generation, overlooking the essential ability to produce cohesive multimodal responses to complex queries. UEval comprises 1,000 expert-curated questions sourced from eight diverse real-world tasks, including areas such as space exploration, textbook explanations, and technical diagrams, demanding both visual and textual outputs. The research team meticulously crafted these questions to encompass a wide range of reasoning types, from detailed step-by-step guides to comprehensive textbook-style explanations, ensuring a challenging and comprehensive assessment of model capabilities.

Evaluating open-ended multimodal generation presents significant challenges, as simplistic “LLM-as-a-judge” approaches often fail to capture nuanced details. To overcome this, the researchers developed a novel rubric-based scoring system, differing from previous methods that relied on Multimodal large language models (MLLMs) to simply rate image quality or text accuracy. This system begins with a frontier MLLM generating an initial rubric, a set of evaluation criteria, based on reference images and text answers provided for each question. Subsequently, human experts refine and validate these rubrics, resulting in a total of 10,417 validated criteria, enabling scalable and fine-grained automatic scoring.

This innovative approach ensures consistent and reproducible evaluation, addressing a key limitation of existing benchmarks. Experiments demonstrate that UEval poses a substantial challenge to current unified models, with GPT-5-Thinking achieving a score of only 66.4 out of 100, while the best-performing open-source model attained a mere 49.1. The study reveals a clear advantage for reasoning models over non-reasoning counterparts, and surprisingly, transferring reasoning traces from a reasoning model to a non-reasoning model significantly improves performance. This suggests that robust reasoning capabilities are crucial for tasks requiring complex multimodal understanding and generation, opening new avenues for research into integrating reasoning mechanisms within unified models. The work establishes a new standard for evaluating these models and highlights the importance of multimodal reasoning for advanced AI systems.

UEval benchmark construction and rubric scoring are crucial

Scientists developed UEval, a benchmark comprising 1,000 expert-curated questions designed to evaluate unified models capable of generating both images and text. The study sourced these questions from eight real-world tasks, encompassing diverse reasoning types such as step-by-step guides and textbook explanations. Researchers meticulously constructed this benchmark to address limitations in existing evaluation paradigms, which primarily focus on either visual question answering or text-to-image generation, neglecting the integrated multimodal reasoning required for unified generation. To overcome the challenges of evaluating open-ended multimodal outputs, the team pioneered a rubric-based scoring system, diverging from methods relying on multimodal Large Language Models (MLLMs) for simple image quality or text accuracy assessments.

The methodology began with the manual collection of reference text and image answers for each question within the UEval dataset. Subsequently, a leading MLLM, specifically Gemini-2.5-Pro, was employed to generate initial evaluation rubrics based on the questions and corresponding references. These initial rubrics, consisting of multiple criteria, were then subjected to rigorous refinement and validation by human experts, ensuring clarity and comprehensiveness. This process culminated in a total of 10,417 validated rubric criteria, facilitating scalable and fine-grained automatic scoring of model responses.

The team harnessed Gemini-2.5-Pro as a judge model, utilising the established rubrics to score outputs and demonstrating strong correlation with human judgements. Experiments involved evaluating nine unified models on the UEval benchmark, revealing its challenging nature for current state-of-the-art systems. GPT-5-Thinking achieved a score of 66.4 out of 100, while the highest-performing open-source model attained only 49.1. Analysis of the results indicated that reasoning-based models consistently outperformed non-reasoning models, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly reduced the performance gap. This observation suggests that robust reasoning capabilities are crucial for tasks demanding complex multimodal understanding and generation, highlighting the importance of the innovative evaluation framework developed in this work.

UEval benchmark reveals multimodal model limitations in complex

Scientists have introduced UEval, a new benchmark designed to rigorously evaluate unified multimodal models capable of generating both images and text. The benchmark comprises 1,000 expert-curated questions sourced from eight real-world tasks, demanding outputs that integrate both visual and textual information. Researchers developed a rubric-based scoring system, utilising a multimodal large language model (MLLM) to initially generate evaluation criteria, subsequently refined and validated by human experts, resulting in a total of 10,417 validated rubric criteria for scalable and fine-grained automatic scoring. Experiments revealed that current unified models face significant challenges on UEval, with GPT-5-Thinking achieving a score of 66.4 out of 100, while the highest-performing open-source model, Emu3.5, reached only 49.1.

The team measured performance across a range of reasoning types, from step-by-step guides to textbook explanations, and observed that reasoning-based models consistently outperformed their non-reasoning counterparts. Specifically, transferring reasoning traces from a reasoning model to a non-reasoning model substantially narrowed the performance gap, suggesting the importance of reasoning for complex multimodal understanding and generation. Data shows that models struggle with generating consistent labeling across multiple images in multi-step planning tasks, such as drawing a cat step-by-step. Researchers recorded instances of mislabeled sub-images, like two images incorrectly tagged as step five in the cartoon cat drawing task.

Interestingly, appending reasoning traces generated by GPT-5-Thinking to prompts for non-reasoning models like GPT-5-Instant and Gemini-2.5-Flash significantly improved visual outputs, while open-source models like BAGEL showed no such improvement. Tests prove that Gemini-2.5-Pro demonstrates strong agreement with human judgements when used as a judge model to score responses using the developed rubrics. The study encompasses both closed-ended tasks, focusing on factual understanding and grounded explanations, and open-ended tasks, concentrating on step-by-step visual guides. Each sample within UEval includes a question prompt and a detailed grading rubric, enabling comprehensive evaluation of model outputs across diverse tasks ranging from scientific knowledge to specialised academic content.

UEval benchmark reveals multimodal model limitations in complex

Scientists have introduced UEval, a new benchmark designed to assess unified multimodal models capable of generating both images and text simultaneously. This benchmark comprises 1,000 questions curated by experts, drawn from eight real-world tasks, and requiring outputs containing both visual and textual elements. Unlike previous evaluation methods, UEval employs a rubric-based scoring system, utilising multimodal large language models to generate initial evaluation criteria which are then refined by human experts, resulting in 10,417 validated criteria for scalable and detailed assessment. The findings demonstrate that current unified models, including GPT-5-Thinking and leading open-source alternatives, find UEval challenging, achieving scores of 66.4 and 49.1 out of 100 respectively.

Researchers observed that questions demanding reasoning skills consistently outperformed those that did not, and transferring reasoning traces improved performance on non-reasoning tasks, suggesting the importance of reasoning for complex multimodal understanding and generation. The authors acknowledge a limitation in relying on large language models for rubric generation, despite human validation, and suggest future work could explore more robust automated evaluation techniques. Further research directions include developing stronger multimodal models and benchmarks to address the challenges highlighted by UEval and advance the field of multimodal generation.

👉 More information
🗞 UEval: A Benchmark for Unified Multimodal Generation
🧠 ArXiv: https://arxiv.org/abs/2601.22155

Tags:

multimodal understanding reasoning traces