AI Verifiers Achieve 99% Accuracy in Label-Free Visual Reasoning

Visual reasoning, the ability to understand and interpret the relationships between objects in an image, presents a significant challenge for artificial intelligence systems. Damiano Marsili and Georgia Gkioxari, from the California Institute of Technology, lead a team that addresses this problem with a novel training framework, eliminating the need for extensive labelled datasets. Their approach leverages the power of AI-driven ‘verifiers’, large language models and vision-language models, to refine reasoning and strengthen visual grounding, effectively teaching the system to ‘check its work’. This innovative method combines the strengths of advanced language processing with robust visual understanding, achieving superior performance on diverse spatial reasoning tasks and surpassing both open-source and proprietary models, while also improving upon recent text-only visual reasoning techniques.

Visual Reasoning via Language Model Integration

Researchers developed a system that accurately answers complex questions about images, requiring spatial reasoning, object recognition, and understanding relationships between objects. The system leverages large language and visual language models, employing a multi-stage verification process to ensure the quality of object detections. This process involves sending prompts to a visual language model to filter inaccurate detections and resolve duplicates, guided by carefully crafted prompts that generate challenging spatial reasoning questions and evaluate predicted answers. The system generates questions demanding multiple reasoning steps, clear spatial terms, and quantitative analysis, firmly grounding them in image content.

A coarse filtering prompt initially removes inaccurate object detections based on label accuracy and bounding box quality. Subsequent prompts verify the accuracy of cropped images, confirming visible objects match predicted labels, and a deduplication prompt resolves remaining duplicates, including nested objects, ensuring only accurate and non-redundant detections remain. This rigorous verification process, combined with carefully engineered prompts, significantly improves the system’s ability to accurately interpret visual information.

VALOR Framework Improves Reasoning and Visual Grounding

Scientists introduced VALOR, a novel training framework that enhances both reasoning and visual grounding in artificial intelligence systems tackling complex spatial reasoning tasks. Addressing limitations in existing methods, VALOR employs AI-powered verifiers, a language model and a vision-language model, to refine reasoning and strengthen visual grounding without requiring labelled data. The framework uses a reinforcement learning approach, where a language model verifier improves reasoning through iterative refinement, while a vision model verifier automatically identifies challenging cases. The system decomposes spatial queries into simpler subtasks, leveraging advanced language models and robust vision specialists.

A structured reward model, inspired by verifiable reward systems for mathematical reasoning, guides learning in the absence of labelled data. During operation, the system detects relevant objects and verifies successful detections before proceeding, accurately identifying specific objects and their spatial relationships, such as a sofa to the right of a coffee table. The system computes pseudo-3D heights from 2D image data, integrating object depth information to overcome limitations of pixel-wise measurements, allowing accurate comparison of object sizes and spatial relationships. This framework surpasses both open-source and proprietary models in visual reasoning tasks, and the improved visual grounding outperforms recent text-only approaches.

VALOR Boosts Visual Reasoning Without Annotations

Researchers present VALOR, a new training framework that significantly enhances visual reasoning capabilities in large language models without requiring manual annotations. This approach leverages AI-powered verifiers, a language and vision-language model, to refine reasoning and strengthen visual grounding, effectively creating a self-improving system. Experiments demonstrate VALOR surpasses existing open-source and proprietary methods on diverse spatial reasoning tasks, achieving a score of 44. 0% on the OMNI3D-BENCH benchmark, outperforming the previous best result of 38. 9%.

Further results on the GQA benchmark show VALOR achieving 63. 0%, a substantial improvement over 46. 9% attained by a competing approach. Importantly, VALOR achieves these results using a relatively small Qwen3-8B language model, while many comparison methods rely on larger, proprietary models like GPT-4o. On the ROBOSPATIAL benchmark, VALOR achieves 69.

5%, exceeding 60. 9% achieved by Gemini-2. 0-Flash. Analysis of visual grounding capabilities reveals that increasing the number of pseudo-annotations improves performance across multiple benchmarks, demonstrating strong performance on benchmarks like VSR and COUNTBENCHQA with 30. 8k pseudo-annotations.

VALOR consistently outperforms supervised fine-tuning on reasoning-heavy benchmarks, achieving 44. 0% on OMNI3D-BENCH compared to 38. 3% with supervised fine-tuning, confirming the effectiveness of the verifier-based reinforcement learning approach in improving both reasoning and visual grounding.

VALOR Enhances AI Visual Reasoning Performance

Scientists present VALOR, a new training framework designed to enhance visual reasoning capabilities in artificial intelligence systems. VALOR addresses limitations in existing methods, which either require extensive labelled data or struggle with logical consistency and accurate object identification. This framework employs AI-powered ‘verifiers’, both language and vision models, to refine reasoning and strengthen visual grounding without relying on ground truth labels. Evaluations across diverse spatial reasoning tasks demonstrate that VALOR surpasses both open-source and proprietary models, and outperforms recent text-only visual reasoning methods.

The success of VALOR stems from the insight that stronger AI models are more reliable as evaluators of reasoning and visual accuracy than as direct generators of answers. The researchers release code and models to encourage further development, and suggest that integrating the verifiers into reinforcement learning training could further improve performance, while exploring guided query generation could enhance reasoning capabilities. This research represents a significant step towards more robust and reliable visual reasoning in artificial intelligence, offering a promising pathway for systems that can accurately interpret and interact with the visual world.

👉 More information
🗞 No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
🧠 ArXiv: https://arxiv.org/abs/2512.08889

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Plasma Applications Enabled by Model Correcting 40% Heating Error in Electron Temperature

Quantum Technology Enables Precise Current Measurement with a Saturable, Lower Bound

January 9, 2026
Enhanced Quasiparticle Density Advances Tunable Emission in PVA-Doped Monolayer WS with 41% Improvement

Relativistic Fluid Dynamics Enables Precise Momentum Spectrum Analysis with Zero Order Terms and Ab Initio Calculation

January 9, 2026
Efficient LLM Inference Achieves Speedup with 4-bit Quantization and FPGA Co-Design

Space Data Centers Achieve Communication Efficiency with OptiVote and Federated Learning

January 9, 2026