Overview

PBench overview

How to measure the capabilities of world models is still an open research question. We build the PBench benchmark to quantitatively measure the progress of world models in several Physical AI target domains including autonomous vehicle (AV) driving, robotics, industry (smart space), physics, human and common sense. For each domain, we select representative videos to cover the typical challenging cases for world models. We use VLMs to generate detailed captions for each video, and manually correct the mistakes of the generated captions. A key frame is selected from each video, and used as the conditioning image for the world model to generate future frames with the manually corrected caption as the input text prompt.

To measure the quality of generated videos, we design a set of binary questions for each sample (conditioning image and text prompt pair) in PBench. The questions of PBench are designed based on the Physical AI ontology proposed in Cosmos-Reason1. The top level of the ontology includes three dimensions: Space, Time and Fundamental Physics. We further break down each dimension into three subcategories: Relationship, Interaction and Geometry for Space; Actions, Order and Camera for Time; Attributes, States, Object Permanence and Physical Laws for Fundamental Physics. For each video sample, we prompt VLMs to generate candidate question-answer (QA) pairs for each subcategory, and manually remove low-quality QA pairs and correct any mistakes in the QA pairs.

We measure two scores in PBench: Domain Score and Quality Score. The Domain Score measures the domain-specific capabilities of the world model via the QA pairs, and the Quality Score measures the quality of the generated video. Given a generated video from a world model and the corresponding QA pairs in PBench, we employ a VLM (Qwen2.5-VL-72B-Instruct) as a judge to measure the generated video by calculating the accuracy as the domain score on all the QA pairs for each sample. In total, PBench includes 1044 samples and 5636 QA pairs across all Physical AI domains. Each sample has a conditioning image, a text prompt, and an average of 5.4 QA pairs. The final domain score on PBench is computed by averaging the score over all 1044 samples. For the Quality Score, we adopt 8 metrics from the VBench to measure the quality of the generated video. The final score on PBench is computed by averaging the domain score and the quality score.

PBench Examples

Below are some PBench examples for each domain. On the left, we show the conditioning image and the text prompt. On the right, we show the generated video from a model and the QA pairs to evaluate the domain score. You can click the arrows to navigate through the examples.

Results

We evaluate the capabilities of the Cosmos-Predict2 world model and other open-source image-to-video models on PBench. We also provide the domain score breakdown and the quality score breakdown for each model.

PBench Results

Model PBench Overall Score PBench Domain Score PBench Quality Score
LTX-Video 74.0 77.2 70.8
HunyuanVideo-I2V 74.0 77.4 70.6
CogVideoX-5B-I2V 74.2 79.5 69.0
Wan2.1-I2V-14B-720P 75.8 81.9 69.7
Cosmos-Predict1-7B-Video2World 73.2 77.4 69.0
Cosmos-Predict1-14B-Video2World 73.3 77.6 69.0
Cosmos-Predict2-2B-Video2World 77.2 84.8 69.6
Cosmos-Predict2-14B-Video2World 77.4 84.9 69.9

PBench Domain Score Breakdown

Model AV Robot Industry Physics Human Common Sense Avg. (Domain Score)
LTX-Video 52.2 71.2 85.0 90.3 75.5 89.1 77.2
HunyuanVideo-I2V 57.0 67.2 86.6 89.1 75.2 89.1 77.4
CogVideoX-5B-I2V 57.9 74.2 86.6 89.2 79.1 89.8 79.5
Wan2.1-I2V-14B-720P 64.9 78.5 88.2 89.0 79.8 91.2 81.9
Cosmos-Predict1-7B-Video2World 61.2 72.8 83.6 82.2 77.3 88.6 77.6
Cosmos-Predict1-14B-Video2World 59.9 72.9 83.6 82.0 77.3 88.7 77.4
Cosmos-Predict2-2B-Video2World 69.1 82.8 89.7 91.1 83.0 92.9 84.8
Cosmos-Predict2-14B-Video2World 69.7 82.9 89.7 92.5 81.9 92.5 84.9

PBench Quality Score Breakdown

Model I2V-Bg. I2V-Subj. Aesthetic Bg. Consistency Imaging Motion Overall Consistency Subj. Consistency Avg. (Quality Score)
LTX-Video 69.0 66.0 54.6 95.3 67.4 99.3 20.5 94.4 70.8
HunyuanVideo-I2V 69.2 66.5 52.7 94.7 66.8 99.5 20.4 94.6 70.6
CogVideoX-5B-I2V 66.7 63.3 53.5 93.4 66.0 97.9 20.4 91.0 69.0
Wan2.1-I2V-14B-720P 67.9 64.7 52.7 93.3 69.9 98.1 20.5 90.2 69.7
Cosmos-Predict1-7B-Video2World 68.4 64.5 52.3 91.6 68.2 98.6 20.7 87.5 69.0
Cosmos-Predict1-14B-Video2World 68.4 64.5 52.3 91.6 68.2 98.6 20.6 87.5 69.0
Cosmos-Predict2-2B-Video2World 68.1 64.2 52.4 93.5 69.2 98.3 20.8 90.1 69.6
Cosmos-Predict2-14B-Video2World 68.1 64.4 53.1 93.6 70.1 98.4 20.7 90.5 69.9

Citation

Please cite as NVIDIA et al. using the following BibTex:

@misc{nvidia2025pbench,
  title={PBench: A Physical AI Benchmark for World Models},
  author={NVIDIA},
  url={https://huggingface.co/datasets/nvidia/PBench},
  year={2025}
}