Visual reasoning, the ability to understand and interpret the relationships between objects in an image, presents a significant challenge for artificial intelligence systems. Damiano Marsili and Georgia Gkioxari, from the California Institute of Technology, lead a team that addresses this problem with a novel training framework, eliminating the need for extensive labelled datasets. Their approach leverages the power of AI-driven ‘verifiers’, large language models and vision-language models, to refine reasoning and strengthen visual grounding, effectively teaching the system to ‘check its work’. This innovative method combines the strengths of advanced language processing with robust visual understanding, achieving superior performance on diverse spatial reasoning tasks and surpassing both open-source and proprietary models, while also improving upon recent text-only visual reasoning techniques.

Visual Reasoning via Language Model Integration

Researchers developed a system that accurately answers complex questions about images, requiring spatial reasoning, object recognition, and understanding relationships between objects. The system leverages large language and visual language models, employing a multi-stage verification process to ensure the quality of object detections. This process involves sending prompts to a visual language model to filter inaccurate detections and resolve duplicates, guided by carefully crafted prompts that generate challenging spatial reasoning questions and evaluate predicted answers. The system generates questions demanding multiple reasoning steps, clear spatial terms, and quantitative analysis, firmly grounding them in image content.

A coarse filtering prompt initially removes inaccurate object detections based on label accuracy and bounding box quality. Subsequent prompts verify the accuracy of cropped images, confirming visible objects match predicted labels, and a deduplication prompt resolves remaining duplicates, including nested objects, ensuring only accurate and non-redundant detections remain. This rigorous verification process, combined with carefully engineered prompts, significantly improves the system’s ability to accurately interpret visual information.

VALOR Framework Improves Reasoning and Visual Grounding

Scientists introduced VALOR, a novel training framework that enhances both reasoning and visual grounding in artificial intelligence systems tackling complex spatial reasoning tasks. Addressing limitations in existing methods, VALOR employs AI-powered verifiers, a language model and a vision-language model, to refine reasoning and strengthen visual grounding without requiring labelled data. The framework uses a reinforcement learning approach, where a language model verifier improves reasoning through iterative refinement, while a vision model verifier automatically identifies challenging cases. The system decomposes spatial queries into simpler subtasks, leveraging advanced language models and robust vision specialists.

A structured reward model, inspired by verifiable reward systems for mathematical reasoning, guides learning in the absence of labelled data. During operation, the system detects relevant objects and verifies successful detections before proceeding, accurately identifying specific objects and their spatial relationships, such as a sofa to the right of a coffee table. The system computes pseudo-3D heights from 2D image data, integrating object depth information to overcome limitations of pixel-wise measurements, allowing accurate comparison of object sizes and spatial relationships. This framework surpasses both open-source and proprietary models in visual reasoning tasks, and the improved visual grounding outperforms recent text-only approaches.

VALOR Boosts Visual Reasoning Without Annotations

Researchers present VALOR, a new training framework that significantly enhances visual reasoning capabilities in large language models without requiring manual annotations. This approach leverages AI-powered verifiers, a language and vision-language model, to refine reasoning and strengthen visual grounding, effectively creating a self-improving system. Experiments demonstrate VALOR surpasses existing open-source and proprietary methods on diverse spatial reasoning tasks, achieving a score of 44. 0% on the OMNI3D-BENCH benchmark, outperforming the previous best result of 38. 9%.

Further results on the GQA benchmark show VALOR achieving 63. 0%, a substantial improvement over 46. 9% attained by a competing approach. Importantly, VALOR achieves these results using a relatively small Qwen3-8B language model, while many comparison methods rely on larger, proprietary models like GPT-4o. On the ROBOSPATIAL benchmark, VALOR achieves 69.

5%, exceeding 60. 9% achieved by Gemini-2. 0-Flash. Analysis of visual grounding capabilities reveals that increasing the number of pseudo-annotations improves performance across multiple benchmarks, demonstrating strong performance on benchmarks like VSR and COUNTBENCHQA with 30. 8k pseudo-annotations.

VALOR consistently outperforms supervised fine-tuning on reasoning-heavy benchmarks, achieving 44. 0% on OMNI3D-BENCH compared to 38. 3% with supervised fine-tuning, confirming the effectiveness of the verifier-based reinforcement learning approach in improving both reasoning and visual grounding.

VALOR Enhances AI Visual Reasoning Performance

Scientists present VALOR, a new training framework designed to enhance visual reasoning capabilities in artificial intelligence systems. VALOR addresses limitations in existing methods, which either require extensive labelled data or struggle with logical consistency and accurate object identification. This framework employs AI-powered ‘verifiers’, both language and vision models, to refine reasoning and strengthen visual grounding without relying on ground truth labels. Evaluations across diverse spatial reasoning tasks demonstrate that VALOR surpasses both open-source and proprietary models, and outperforms recent text-only visual reasoning methods.

The success of VALOR stems from the insight that stronger AI models are more reliable as evaluators of reasoning and visual accuracy than as direct generators of answers. The researchers release code and models to encourage further development, and suggest that integrating the verifiers into reinforcement learning training could further improve performance, while exploring guided query generation could enhance reasoning capabilities. This research represents a significant step towards more robust and reliable visual reasoning in artificial intelligence, offering a promising pathway for systems that can accurately interpret and interact with the visual world.

👉 More information
🗞 No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
🧠 ArXiv: https://arxiv.org/abs/2512.08889

Tags:

annotation-free training hard-negative mining Large Language Models LLM verifier Reinforcement Learning Spatial Reasoning visual language models Visual Reasoning VLM verifier

AI Verifiers Achieve 99% Accuracy in Label-Free Visual Reasoning

Visual Reasoning via Language Model Integration

VALOR Framework Improves Reasoning and Visual Grounding

VALOR Boosts Visual Reasoning Without Annotations

VALOR Enhances AI Visual Reasoning Performance

Rohail T.

Latest Posts by Rohail T.:

AI Swiftly Answers Questions by Focusing on Key Areas

Machine Learning Sorts Quantum States with High Accuracy

Framework Improves Code Testing with Scenario Planning