PBench — Cosmos Lab

Overview

How to measure the capabilities of world models is still an open research question. We build the PBench benchmark to quantitatively measure the progress of world models in several Physical AI target domains including autonomous vehicle (AV) driving, robotics, industry (smart space), physics, human and common sense. For each domain, we select representative videos to cover the typical challenging cases for world models. We use VLMs to generate detailed captions for each video, and manually correct the mistakes of the generated captions. A key frame is selected from each video, and used as the conditioning image for the world model to generate future frames with the manually corrected caption as the input text prompt.

To measure the quality of generated videos, we design a set of binary questions for each sample (conditioning image and text prompt pair) in PBench. The questions of PBench are designed based on the Physical AI ontology proposed in Cosmos-Reason1. The top level of the ontology includes three dimensions: Space, Time and Fundamental Physics. We further break down each dimension into three subcategories: Relationship, Interaction and Geometry for Space; Actions, Order and Camera for Time; Attributes, States, Object Permanence and Physical Laws for Fundamental Physics. For each video sample, we prompt VLMs to generate candidate question-answer (QA) pairs for each subcategory, and manually remove low-quality QA pairs and correct any mistakes in the QA pairs.

We measure two scores in PBench: Domain Score and Quality Score. The Domain Score measures the domain-specific capabilities of the world model via the QA pairs, and the Quality Score measures the quality of the generated video. Given a generated video from a world model and the corresponding QA pairs in PBench, we employ a VLM (Qwen2.5-VL-72B-Instruct) as a judge to measure the generated video by calculating the accuracy as the domain score on all the QA pairs for each sample. In total, PBench includes 1044 samples and 5636 QA pairs across all Physical AI domains. Each sample has a conditioning image, a text prompt, and an average of 5.4 QA pairs. The final domain score on PBench is computed by averaging the score over all 1044 samples. For the Quality Score, we adopt 8 metrics from the VBench to measure the quality of the generated video. The final score on PBench is computed by averaging the domain score and the quality score.

PBench Examples

Below are some PBench examples for each domain. On the left, we show the conditioning image and the text prompt. On the right, we show the generated video from a model and the QA pairs to evaluate the domain score. You can click the arrows to navigate through the examples.

Results

We evaluate the capabilities of the Cosmos-Predict2 world model and other open-source image-to-video models on PBench. We also provide the domain score breakdown and the quality score breakdown for each model.

PBench Results

Model	PBench Overall Score	PBench Domain Score	PBench Quality Score
LTX-Video	74.0	77.2	70.8
HunyuanVideo-I2V	74.0	77.4	70.6
CogVideoX-5B-I2V	74.2	79.5	69.0
Wan2.1-I2V-14B-720P	75.8	81.9	69.7
Cosmos-Predict1-7B-Video2World	73.2	77.4	69.0
Cosmos-Predict1-14B-Video2World	73.3	77.6	69.0
Cosmos-Predict2-2B-Video2World	77.2	84.8	69.6
Cosmos-Predict2-14B-Video2World	77.4	84.9	69.9

PBench Domain Score Breakdown

Model	AV	Robot	Industry	Physics	Human	Common Sense	Avg. (Domain Score)
LTX-Video	52.2	71.2	85.0	90.3	75.5	89.1	77.2
HunyuanVideo-I2V	57.0	67.2	86.6	89.1	75.2	89.1	77.4
CogVideoX-5B-I2V	57.9	74.2	86.6	89.2	79.1	89.8	79.5
Wan2.1-I2V-14B-720P	64.9	78.5	88.2	89.0	79.8	91.2	81.9
Cosmos-Predict1-7B-Video2World	61.2	72.8	83.6	82.2	77.3	88.6	77.6
Cosmos-Predict1-14B-Video2World	59.9	72.9	83.6	82.0	77.3	88.7	77.4
Cosmos-Predict2-2B-Video2World	69.1	82.8	89.7	91.1	83.0	92.9	84.8
Cosmos-Predict2-14B-Video2World	69.7	82.9	89.7	92.5	81.9	92.5	84.9

PBench Quality Score Breakdown

Model	I2V-Bg.	I2V-Subj.	Aesthetic	Bg. Consistency	Imaging	Motion	Overall Consistency	Subj. Consistency	Avg. (Quality Score)
LTX-Video	69.0	66.0	54.6	95.3	67.4	99.3	20.5	94.4	70.8
HunyuanVideo-I2V	69.2	66.5	52.7	94.7	66.8	99.5	20.4	94.6	70.6
CogVideoX-5B-I2V	66.7	63.3	53.5	93.4	66.0	97.9	20.4	91.0	69.0
Wan2.1-I2V-14B-720P	67.9	64.7	52.7	93.3	69.9	98.1	20.5	90.2	69.7
Cosmos-Predict1-7B-Video2World	68.4	64.5	52.3	91.6	68.2	98.6	20.7	87.5	69.0
Cosmos-Predict1-14B-Video2World	68.4	64.5	52.3	91.6	68.2	98.6	20.6	87.5	69.0
Cosmos-Predict2-2B-Video2World	68.1	64.2	52.4	93.5	69.2	98.3	20.8	90.1	69.6
Cosmos-Predict2-14B-Video2World	68.1	64.4	53.1	93.6	70.1	98.4	20.7	90.5	69.9

Citation

Please cite as NVIDIA et al. using the following BibTex:

@misc{nvidia2025pbench,
  title={PBench: A Physical AI Benchmark for World Models},
  author={NVIDIA},
  url={https://huggingface.co/datasets/nvidia/PBench},
  year={2025}
}