The ability to explain how an artificial intelligence arrives at a conclusion remains a significant challenge, despite recent advances in multimodal large language models. Haobo Yuan, Yueyi Sun, and Yanwei Li, along with colleagues at their respective institutions, now address this opacity with a new benchmark designed to evaluate the reasoning processes of these systems. Their work introduces the Visual Reasoning Tracer (VRT) task, which demands that an AI not only identify an object but also explicitly predict the intermediate steps and objects that form its line of reasoning. By creating a large-scale dataset, VRT-80k, and a new evaluation metric, the team demonstrates that current models frequently succeed in providing the correct answer without revealing how they reached it, and that training specifically on reasoning traces substantially improves performance in this crucial area.
The research investigates visual reasoning capabilities, specifically focusing on identifying correspondences between objects and attributes within complex scenes. The approach involves analysing images to locate individuals and objects, then establishing connections based on shared characteristics, such as colour matching between clothing and nearby items. The methodology centres on sequential observation, beginning with a broad scan of the image to identify key elements, followed by a focused examination of individuals and surrounding objects. The system then correlates attributes, for example, determining which person’s attire matches the colour of branded umbrellas present in the scene, and can infer association based on safety equipment, such as a helmet indicating operation of a two-wheeled vehicle.
Spatial Relationships Reveal Interaction Potential
The research explores how spatial relationships between objects can reveal potential interactions. To determine these interactions, the system analyses the positions of objects within a scene, suggesting potential for collision or interaction based on proximity and arrangement. For example, a large, red ball positioned near a small, blue box indicates a higher likelihood of interaction than a green plant located further away. This analysis demonstrates a method for understanding how objects might influence each other based on their spatial context.
Visual Reasoning Benchmark Reveals Intermediate Steps
Recent research introduces a new benchmark, VRT-Bench, designed to evaluate visual reasoning capabilities in multimodal models. The study addresses a key limitation of current models, which often produce correct final answers without revealing the intermediate steps or reasoning processes used to arrive at those conclusions. VRT-Bench comprises 304 complex question-answer samples, categorized into comparison, function, location, and visual features, with 184 samples requiring multiple reasoning types. Experiments demonstrate that models trained using this benchmark exhibit substantial improvements in tracing the reasoning path.
Researchers evaluated several state-of-the-art multimodal large language models, including Gemini-2. 5 Pro and Qwen3-VL, both with and without integration with the SAM-2 segmentation model. The team reports performance across four core reasoning capabilities, with results showing significant variation depending on the reasoning type. Through supervised fine-tuning, reinforcement learning, and the incorporation of segmentation loss, researchers achieved improvements in Logic Quality, Visual Quality, and mIoU metrics, demonstrating the effectiveness of the proposed benchmark and training techniques.
Grounded Reasoning Traces Reveal Model Limitations
Current Multimodal Large Language Models often succeed in providing correct answers to visual questions, yet lack transparency in their reasoning processes. To facilitate research in this area, the team developed VRT-Bench, a benchmark for evaluating visual reasoning, and VRT-80k, a large-scale dataset for training models to perform this task. Experiments demonstrate that existing models struggle to generate these grounded reasoning traces, often failing to connect their final answers to supporting visual evidence. However, a model fine-tuned on the VRT-80k dataset successfully acquires the ability to trace the reasoning path, achieving strong performance in generating explicit, visually-grounded explanations. This work establishes a foundational framework for developing more interpretable and reliable models, moving beyond simply providing correct answers to enabling verifiably grounded decision-making.
👉 More information
🗞 Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark
🧠 ArXiv: https://arxiv.org/abs/2512.05091
