Abstract

Training Physical AI systems in digital environments requires a physical world simulator. Cosmos-Predict2 is the latest version of the Cosmos world model, designed for simulating and predicting the future state of the world as video. Cosmos-Predict2 features four models: Cosmos-Predict2-2B-Text2Image and Cosmos-Predict2-14B-Text2Image for text-to-image generation, and Cosmos-Predict2-2B-Video2World and Cosmos-Predict2-14B-Video2World for video-to-world generation.

Video-to-World Generation

Cosmos-Predict2-14B-Video2World

PBench evaluation on video-to-world generation. Higher Domain Score, Quality Score, and PBench Score are better. ↑ = higher is better.

Model Domain Score ↑ Quality Score ↑ PBench Score ↑
LTX-Video 74.0 77.2 70.8
HunyuanVideo-I2V 74.0 77.4 70.6
CogVideoX-5B-I2V 74.2 79.5 69.0
Wan2.1-I2V-14B-720P 75.8 81.9 69.7
Cosmos-Predict1-7B-Video2World 73.2 77.4 69.0
Cosmos-Predict1-14B-Video2World 73.3 77.6 69.0
Cosmos-Predict2-2B-Video2World 77.2 84.8 69.6
Cosmos-Predict2-14B-Video2World 77.4 84.9 69.9

Text-to-Image Generation

Cosmos-Predict2-2B / Cosmos-Predict2-14B

GenEval benchmark for text-to-image generation. Higher is better ↑.

Model Overall ↑ Single Obj. ↑ Two Obj. ↑ Counting ↑ Colors ↑ Position ↑ Color attribution ↑
Stable Diffusion XL 0.55 0.98 0.74 0.39 0.85 0.15 0.23
DALL-E 3 0.67 0.96 0.87 0.47 0.83 0.43 0.45
Flux 1-Dev 0.66 0.98 0.79 0.73 0.77 0.22 0.45
Cosmos-Predict2-2B 0.83 1.00 0.99 0.73 0.89 0.65 0.73
Cosmos-Predict2-14B 0.84 1.00 0.98 0.79 0.90 0.64 0.72

Citation

Please cite as NVIDIA et al. using the following BibTex:

@misc{nvidia2025cosmospredict2,
  title={Cosmos-Predict2: World Simulation Model for Physical AI},
  author={NVIDIA},
  url={https://github.com/nvidia-cosmos/cosmos-predict2},
  year={2025}
}