Abstract
Training Physical AI systems in digital environments requires a physical world simulator. Cosmos-Predict2 is the latest version of the Cosmos world model, designed for simulating and predicting the future state of the world as video. Cosmos-Predict2 features four models: Cosmos-Predict2-2B-Text2Image and Cosmos-Predict2-14B-Text2Image for text-to-image generation, and Cosmos-Predict2-2B-Video2World and Cosmos-Predict2-14B-Video2World for video-to-world generation.
Video-to-World Generation
Cosmos-Predict2-14B-Video2World
PBench evaluation on video-to-world generation. Higher Domain Score, Quality Score, and PBench Score are better. ↑ = higher is better.
| Model | Domain Score ↑ | Quality Score ↑ | PBench Score ↑ |
|---|---|---|---|
| LTX-Video | 74.0 | 77.2 | 70.8 |
| HunyuanVideo-I2V | 74.0 | 77.4 | 70.6 |
| CogVideoX-5B-I2V | 74.2 | 79.5 | 69.0 |
| Wan2.1-I2V-14B-720P | 75.8 | 81.9 | 69.7 |
| Cosmos-Predict1-7B-Video2World | 73.2 | 77.4 | 69.0 |
| Cosmos-Predict1-14B-Video2World | 73.3 | 77.6 | 69.0 |
| Cosmos-Predict2-2B-Video2World | 77.2 | 84.8 | 69.6 |
| Cosmos-Predict2-14B-Video2World | 77.4 | 84.9 | 69.9 |
Text-to-Image Generation
Cosmos-Predict2-2B / Cosmos-Predict2-14B
GenEval benchmark for text-to-image generation. Higher is better ↑.
| Model | Overall ↑ | Single Obj. ↑ | Two Obj. ↑ | Counting ↑ | Colors ↑ | Position ↑ | Color attribution ↑ |
|---|---|---|---|---|---|---|---|
| Stable Diffusion XL | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 |
| DALL-E 3 | 0.67 | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 |
| Flux 1-Dev | 0.66 | 0.98 | 0.79 | 0.73 | 0.77 | 0.22 | 0.45 |
| Cosmos-Predict2-2B | 0.83 | 1.00 | 0.99 | 0.73 | 0.89 | 0.65 | 0.73 |
| Cosmos-Predict2-14B | 0.84 | 1.00 | 0.98 | 0.79 | 0.90 | 0.64 | 0.72 |
Citation
Please cite as NVIDIA et al. using the following BibTex:
@misc{nvidia2025cosmospredict2,
title={Cosmos-Predict2: World Simulation Model for Physical AI},
author={NVIDIA},
url={https://github.com/nvidia-cosmos/cosmos-predict2},
year={2025}
}