Abstract
With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
Method
Building Physical Critic Capacity via Two-Stage RL Finetuning
PhyCritic develops physical critic capacity through a two-stage RL training pipeline.
- Stage 1: Physical Skill Warmup. The model is first trained on verifiable physical QA pairs standard GRPO to acquire reliable perception and reasoning abilities, which form the foundation for subsequent critic training.
- Stage 2: Self-Referential Critic Finetuning. The model then learns to judge pairs of responses by anchoring preference judgments to its own physical predictions, resulting in more grounded and interpretable critic behavior.
PhyCritic training pipeline. The model is first trained with GRPO on physical QA pairs to strengthen physical reasoning (left), and then finetuned via self-referential critic learning to further enhance critique capacity (right).
Self-Referential Critic Finetuning
First Predict, Then Critique
During training, PhyCritic jointly performs two tasks:
- Self-Prediction: The model first produces its own internal physical prediction for the given question.
- Preference Judgment: Acting as a critic, the model then predicts its preference over a pair of candidate responses, explicitly grounding the judgment in its prior self-prediction.
Reward Design
- Format Reward: Encourages structured outputs that expose both self-prediction and critique.
- Self-Prediction Reward: Rewards correct physical reasoning by comparing the self-prediction with the ground-truth answer to the question.
- Critic Reward: Rewards accurate preference judgments by matching the predicted preference to the golden preference label.
Critic prompt for self-referential critic fine-tuning. After presenting detailed evaluation criteria, it explicitly instructs the judge model to first generate its own reasoning and prediction for the given question, then use its self-prediction as a reference during the critique process.
Results
State-of-the-Art Physical Judging & Reasoning
PhyCritic achieves the best performance among open-source 7B/8B models on physical judgment, generalizes well to broader multimodal reward tasks, and enhances physical reasoning.
Improved physical and general critic. PhyCritic achieves gains of ≥12.0 points over leading general-purpose open-source VLMs and RL-finetuned physical models on PhyCritic-Bench, and generalizes to unseen subdomains (RoboFail, LingoQA) while transferring to general image-domain judging benchmarks (Multimodal RewardBench, VL-RewardBench).
Improved physical reasoning.
As a policy on physical tasks, PhyCritic substantially outperforms the base Qwen2.5-VL-7B and surpasses physical reasoning–oriented VLMs, including Cosmos-Reason1 and RoboBrain-2.0, using far fewer training samples (4,058 in two-stage finetuning).
Ablation Studies and Analyses
Two stage RL is Crucial
(i) Policy RL warmup builds foundational physical skills through policy RL warmup, and (ii) self-referential critic finetuning—leveraging diverse reasoning traces and high-quality preference signals—further consolidates judgment consistency, improves generalization, and reduces overfitting.
Stage 1 improves physical reasoning (+7.5 on CosmosReason1-Bench) but brings limited judgment gains (+2.0 on PhyCritic-Bench), while Stage 2 mainly strengthens critic capability (+14.4 on PhyCritic-Bench) and further enhances reasoning performance (+2.1 on CosmosReason1-Bench)
Self-Referential Critic Drives the Gain
The full model (blue) consistently outperforms variants without the self-referential critic prompt or the self-prediction reward on both physical judgment and reasoning. By explicitly instructing the judge model to ground its evaluations in its own reasoning and problem-solving behavior, and by applying rewards that reinforce accurate self-reasoning, PhyCritic simultaneously learns to produce generalized, consistent judgment and faithful reasoning.
Removing the explicit self-referential process leads to clear performance drops across physical judgment, physical reasoning, and general judgment benchmarks. Retaining the self-reference prompt while removing the self-prediction reward results in a smaller yet noticeable decline.
PhyCritic for Test-time Scaling
By reliably identifying the highest-quality trajectory among multiple candidates in a pairwise knockout tournament, PhyCritic functions as a strong ensemble mechanism, substantially enhancing test-time performance on physical reasoning tasks.
PhyCritic achieves a +6.5-point improvement at 𝑁 = 32 for best-of-N sampling on Qwen2.5-VL-7B-Instruct, consistently outperforming majority voting and other models used as judges.
Qualitative Examples
Citation
Please kindly cite PhyCritic if you find our paper useful.
@article{xiong2026phycritic,
title = {PhyCritic: Multimodal Critic Models for Physical AI},
author = {Tianyi Xiong and Shihao Wang and Guilin Liu and Yi Dong and Ming Li and Heng Huang and Jan Kautz and Zhiding Yu},
journal = {arXiv:2602.11124},
year = {2026},
}