AI Gains Enhanced Perception and Reasoning with New Critic Model

Researchers are addressing a critical gap in artificial intelligence by developing more robust and reliable critic models for evaluating complex AI systems. Tianyi Xiong, Shihao Wang, and Guilin Liu from the University of Maryland, College Park, alongside Yi Dong, Ming Li, Heng Huang, Jan Kautz, and Zhiding Yu, present PhyCritic, a novel multimodal critic model designed specifically for physical AI tasks. This work represents a significant advance as it moves beyond general visual domains to focus on areas requiring perception, causal reasoning, and planning, achieved through a collaborative effort between researchers at the University of Maryland, College Park and other institutions. PhyCritic employs a two-stage reinforcement learning pipeline to enhance both oriented perception and judgement stability, demonstrably improving performance on multimodal judge benchmarks and bolstering perception and reasoning capabilities in grounded tasks.

This research introduces PhyCritic, a multimodal critic model optimised for physical AI through a two-stage reinforcement learning from visual reasoning (RLVR) pipeline. Unlike traditional visual recognition tasks, physical AI demands models interpret complex multi-view observations, understand object affordances, reason over causal dynamics, and assess how hypothetical actions unfold in real environments. This paradigm encompasses 3D perception and spatial grounding, robot-centric interaction understanding, and safety-critical domains like autonomous driving. As these systems scale, multimodal evaluation is increasingly crucial to measure a model’s physically correct, visually grounded, and human-aligned reasoning. Despite progress in multimodal large language models (MLLMs), reliable multimodal critic model development lags behind, with existing reward or judge models predominantly focusing on general domains such as captioning, STEM reasoning, and image question answering. Evaluation in physical AI differs fundamentally, requiring assessment of causal validity, adherence to physical configurations, and respect for temporal, spatial, and dynamical constraints. Recent work has extended multimodal judges and RL-based critic training to physical scenarios, with early efforts like DriveCritic highlighting the importance of domain-specific judgment capabilities. However, existing critics lack physics awareness, often failing to distinguish visually coherent but physically impossible reasoning, and their training data focuses on broad multimodal evaluation rather than physically grounded scenarios involving manipulation, affordance reasoning, or embodied 3D interactions. They do not ground their decisions in their own physical understanding of the question, potentially leading to inconsistent verdicts. The goal of this work is to bridge this gap by developing a new class of multimodal critics specifically designed for physical AI, aiming to evaluate multimodal responses involving physical perception, causal reasoning, and action or plan assessment, grounded, stable, and physically correct. PhyCritic introduces the principle that a strong physical critic should behave like an expert human judge, solving the problem itself before evaluating other models’ responses, motivating self-referential critic finetuning. This finetuning employs a two-stage RLVR framework, beginning with a physical skill warmup stage, applying standard guided rollout policy optimisation (GRPO) on a small set of physical-related question, answer pairs to strengthen core physical perception and reasoning abilities. Stage two trains the critic to generate its own internal reasoning and prediction for the question, then evaluate candidate responses with explicit reference to this self-prediction, using GRPO with both critic and self-prediction rewards to encourage stable behaviour and coherent physics-aware reasoning. To rigorously evaluate judgment performance in physical contexts, the researchers introduce PhyCritic-Bench, a novel benchmark constructed from diverse embodied datasets such as RoboVQA, BridgeData V2, HoloAssist, and AgiBot World. PhyCritic-Bench includes high-quality physical reasoning questions derived from Cosmos-Reason1 and paired candidate responses scored via verifiable ground truth, enabling fine-grained evaluation of reasoning correctness, visual grounding, and causal validity. The main contributions of this work are the introduction of a self-referential critic learning framework explicitly grounding evaluation in the model’s own physical perception and reasoning, implemented with a two-stage RLVR + GRPO pipeline. They also developed PhyCritic, a multimodal critic specialised for assessing perception, causal reasoning, and planning in physical AI scenarios, and constructed a high-quality physical critic dataset spanning diverse embodied domains with paired candidate responses and verifiable preference labels. Across physical reasoning benchmarks Cosmos-Reason1, CV-Bench, and EgoPlan-Bench2, and general reward benchmarks VL-RewardBench and Multimodal RewardBench, PhyCritic demonstrably outperforms all open-source 7B/8B baseline models. These results demonstrate that critic models benefit significantly from self-referential physical grounding, and that physical AI requires a new generation of physics-aware multimodal judge models. The relentless pursuit of genuinely intelligent artificial systems demands more than just bigger models; it requires robust methods for evaluating their reasoning. This work on PhyCritic represents a subtle but significant advance, moving beyond generic image assessment towards a critic specifically tuned for the complexities of physical understanding. For years, AI evaluation has relied on human judgement or proxies easily fooled by superficial correlations, and building a critic that can discern how an AI arrives at an answer, its causal reasoning, and its grasp of physics, has proven remarkably difficult. This internal benchmark is a clever way to improve consistency and accuracy, addressing a key weakness in many current evaluation systems. While PhyCritic excels at judging AI responses, it remains a learned model itself, susceptible to biases present in its training data. Furthermore, the focus on physical AI narrows the scope; evaluating creativity, nuanced language, or ethical considerations still requires separate approaches. Looking ahead, the real potential lies in integrating such specialised critics into a broader evaluation framework, potentially leading to automated systems that can not only score an AI’s output but also diagnose its weaknesses, guiding further training and development. The ultimate goal isn’t simply to build AI that performs well, but AI that thinks well, and that requires a far more discerning eye than we currently possess.

👉 More information
🗞 PhyCritic: Multimodal Critic Models for Physical AI
🧠 ArXiv: https://arxiv.org/abs/2602.11124

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum Computing Methods Overcome Hardware Limits with Hybrid Classical Approaches

Quantum Computing Methods Overcome Hardware Limits with Hybrid Classical Approaches

February 13, 2026
Stronger Qubit Coupling Boosts Speed and Accuracy of Quantum Measurements

Stronger Qubit Coupling Boosts Speed and Accuracy of Quantum Measurements

February 13, 2026
Quantum Systems Oscillate with Control Fields Exceeding Normal Frequency LimitsQuantum Systems Oscillate with Control Fields Exceeding Normal Frequency Limits

Quantum Systems Oscillate with Control Fields Exceeding Normal Frequency Limits

February 13, 2026