Physical Judging and Reasoning • VLM Critic for Physical AI • Judging via Self-Reference

PhyCritic Multimodal Critic Models for Physical AI

1NVIDIA logo 2University of Maryland logo

* Work done during internship at NVIDIA  ·  † Corresponding author

PhyCritic first generates its own physics-aware reasoning and prediction, then uses it as a reference to judge a pair of responses. In this example, it infers that the oven is closed, allowing it to identify Response 1 as causally correct and Response 2 as proposing an unnecessary action. This self-referential process leads to more stable, physically correct judgments.

Abstract

With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.

Method

Building Physical Critic Capacity via Two-Stage RL Finetuning

PhyCritic develops physical critic capacity through a two-stage RL training pipeline.

  • Stage 1: Physical Skill Warmup. The model is first trained on verifiable physical QA pairs standard GRPO to acquire reliable perception and reasoning abilities, which form the foundation for subsequent critic training.
  • Stage 2: Self-Referential Critic Finetuning. The model then learns to judge pairs of responses by anchoring preference judgments to its own physical predictions, resulting in more grounded and interpretable critic behavior.
PhyCritic training pipeline. The model is first trained with GRPO on physical QA pairs to strengthen physical reasoning (left), and then finetuned via self-referential critic learning to further enhance critique capacity (right).

Self-Referential Critic Finetuning

First Predict, Then Critique

During training, PhyCritic jointly performs two tasks:

  • Self-Prediction: The model first produces its own internal physical prediction for the given question.
  • Preference Judgment: Acting as a critic, the model then predicts its preference over a pair of candidate responses, explicitly grounding the judgment in its prior self-prediction.

Reward Design

  • Format Reward: Encourages structured outputs that expose both self-prediction and critique.
  • Self-Prediction Reward: Rewards correct physical reasoning by comparing the self-prediction with the ground-truth answer to the question.
  • Critic Reward: Rewards accurate preference judgments by matching the predicted preference to the golden preference label.
Critic prompt for self-referential critic fine-tuning. After presenting detailed evaluation criteria, it explicitly instructs the judge model to first generate its own reasoning and prediction for the given question, then use its self-prediction as a reference during the critique process.

Results

State-of-the-Art Physical Judging & Reasoning

PhyCritic achieves the best performance among open-source 7B/8B models on physical judgment, generalizes well to broader multimodal reward tasks, and enhances physical reasoning.

physical judging

Improved physical and general critic. PhyCritic achieves gains of ≥12.0 points over leading general-purpose open-source VLMs and RL-finetuned physical models on PhyCritic-Bench, and generalizes to unseen subdomains (RoboFail, LingoQA) while transferring to general image-domain judging benchmarks (Multimodal RewardBench, VL-RewardBench).

physical reasoning Improved physical reasoning. As a policy on physical tasks, PhyCritic substantially outperforms the base Qwen2.5-VL-7B and surpasses physical reasoning–oriented VLMs, including Cosmos-Reason1 and RoboBrain-2.0, using far fewer training samples (4,058 in two-stage finetuning).

Ablation Studies and Analyses

Two stage RL is Crucial

(i) Policy RL warmup builds foundational physical skills through policy RL warmup, and (ii) self-referential critic finetuning—leveraging diverse reasoning traces and high-quality preference signals—further consolidates judgment consistency, improves generalization, and reduces overfitting.

Data efficiency and training dynamics line plots.

Stage 1 improves physical reasoning (+7.5 on CosmosReason1-Bench) but brings limited judgment gains (+2.0 on PhyCritic-Bench), while Stage 2 mainly strengthens critic capability (+14.4 on PhyCritic-Bench) and further enhances reasoning performance (+2.1 on CosmosReason1-Bench)

Self-Referential Critic Drives the Gain

The full model (blue) consistently outperforms variants without the self-referential critic prompt or the self-prediction reward on both physical judgment and reasoning. By explicitly instructing the judge model to ground its evaluations in its own reasoning and problem-solving behavior, and by applying rewards that reinforce accurate self-reasoning, PhyCritic simultaneously learns to produce generalized, consistent judgment and faithful reasoning.

Data efficiency and training dynamics line plots. Removing the explicit self-referential process leads to clear performance drops across physical judgment, physical reasoning, and general judgment benchmarks. Retaining the self-reference prompt while removing the self-prediction reward results in a smaller yet noticeable decline.

PhyCritic for Test-time Scaling

By reliably identifying the highest-quality trajectory among multiple candidates in a pairwise knockout tournament, PhyCritic functions as a strong ensemble mechanism, substantially enhancing test-time performance on physical reasoning tasks.

Data efficiency and training dynamics line plots. PhyCritic achieves a +6.5-point improvement at 𝑁 = 32 for best-of-N sampling on Qwen2.5-VL-7B-Instruct, consistently outperforming majority voting and other models used as judges.

Qualitative Examples


Citation

Please kindly cite PhyCritic if you find our paper useful.

@article{xiong2026phycritic,
  title   = {PhyCritic: Multimodal Critic Models for Physical AI},
  author  = {Tianyi Xiong and Shihao Wang and Guilin Liu and Yi Dong and Ming Li and Heng Huang and Jan Kautz and Zhiding Yu},
  journal = {arXiv:2602.11124},
  year    = {2026},
}