Abstract
We introduce Cosmos-Predict2.5, the latest version of the Cosmos World Foundation Models (WFMs) family, specialized for simulating and predicting the future state of the world in the form of video. Cosmos-Predict2.5 is a flow based model that unifies Text2World, Image2World, and Video2World into a single model and utilizes Cosmos-Reason1, a Physical AI reasoning vision language model (VLM), as the text encoder.
With an improved data pipeline, we curated 200 million high-quality pre-training video clips and post-training data covering various physical AI domains. We further leverage model merging and a new reinforcement learning algorithm to post-train Cosmos-Predict2.5 for model quality boost. Experimental results on various benchmark datasets show Cosmos-Predict2.5 significantly improves upon Cosmos-Predict1 in both quality and prompt alignment. Cosmos-Predict2.5 comes with two model sizes: 2B and 14B. We show how they can be used for various robotics and autonomous vehicle tasks through post-training. We further extend Cosmos-Predict2.5 into a broader family with Cosmos-Transfer2.5 control-net models for various sim2real and real2real applications. To facilitate the development of world models for Physical AI, we make our code, model weights, and benchmarks available under the NVIDIA Open Model License at Cosmos-Predict2.5 and Cosmos-Transfer2.5.
Models and Capabilities
| Model Name | Capability | Input |
|---|---|---|
| Cosmos-Predict2.5 base | ||
| Cosmos-Predict2.5-2B/pre-trained | pre-trained base | text + image or video |
| Cosmos-Predict2.5-14B/pre-trained | pre-trained base | text + image or video |
| Cosmos-Predict2.5-2B/post-trained | post-trained base | text + image or video |
| Cosmos-Predict2.5-14B/post-trained | post-trained base | text + image or video |
| Cosmos-Predict2.5 domain specialized | ||
| Cosmos-Predict2.5-2B/auto/multiview | driving, 7-camera view | text + image or video |
| Cosmos-Predict2.5-2B/robot/multiview | robotic, 3-camera view | text + third-person video |
| Cosmos-Predict2.5-2B/robot/multiview-agibot | robotic, AgiBot data, 3-camera view | text + head-view video |
| Cosmos-Predict2.5-2B/robot/action-cond | robotic, action-conditioned | action |
| Cosmos-Predict2.5-2B/robot/gr00tdream-gr1 | robotic, GR00T GR1 data | text + image or video |
General World Simulation
Cosmos-Predict2.5/pre-trained
Cosmos-Predict2.5-2B
pre-trained
Cosmos-Predict2.5-14B
pre-trained
Despite being of smaller size, post-trained Cosmos-Predict2.5 is on par with Wan 2.2 5B and 2.1 14B on a diverse set of prompts. We evaluate on PAI-Bench's predict task and report Domain Score (VQA-based, seven physical AI domains) and Quality Score (eight T2V/I2V metrics adapted from VBench).
PAI-Bench Text2World
| Model | Domain Score | Quality Score | Overall Score |
|---|---|---|---|
| Predict2.5-2B [pre-train] | 0.782 | 0.720 | 0.751 |
| Predict2.5-2B [post-train] | 0.804 | 0.732 | 0.768 |
| Predict2.5-14B [pre-train] | 0.791 | 0.722 | 0.757 |
| Predict2.5-14B [post-train] | 0.803 | 0.732 | 0.768 |
| Wan2.1-1.3B | 0.786 | 0.726 | 0.756 |
| Wan2.1-14B | 0.794 | 0.727 | 0.761 |
| Wan2.2-5B | 0.797 | 0.730 | 0.764 |
| Wan2.2-A14B | 0.810 | 0.728 | 0.769 |
PAI-Bench Image2World
Comparison of Cosmos-Predict2.5 with Wan2.1 and Wan2.2 I2V models.
| Model | Domain Score | Quality Score | Overall Score |
|---|---|---|---|
| Predict2.5-2B [pre-train] | 0.824 | 0.775 | 0.799 |
| Predict2.5-2B [post-train] | 0.840 | 0.779 | 0.810 |
| Predict2.5-14B [pre-train] | 0.835 | 0.777 | 0.806 |
| Predict2.5-14B [post-train] | 0.838 | 0.781 | 0.810 |
| Wan2.1-14B | 0.827 | 0.768 | 0.797 |
| Wan2.2-5B | 0.834 | 0.774 | 0.804 |
| Wan2.2-A14B | 0.841 | 0.772 | 0.806 |
Autonomous Driving
Cosmos-Predict2.5-2B/auto/multiview
7-camera multiview generation conditioned on text and image or video.
Visual Metrics on Generated Multi-View Videos
We use a 1,000 multi-view clip dataset in RQS-HQ (Ren et al., 2025), with HD map, as well as human-labeled lanes and cuboids. We observe a significant boost from Predict1-7B-Sample-AV (up to 2.3×) in FVD/FID scores while remaining competitive in temporal and cross-camera Sampson error compared to real videos.
| Model | FVD StyleGAN ↓ | FVD I3D ↓ | FID ↓ | TSE ↓ | CSE ↓ |
|---|---|---|---|---|---|
| Predict2.5-2B/auto/multiview | 23.060 | 25.308 | 12.095 | 0.948 | 1.903 |
| Predict1-7B-Sample-AV | 63.685 | 69.613 | 25.341 | 0.930 | 2.631 |
| Real Videos (Reference) | — | — | — | 1.193 | 1.832 |
Robotics
Multiview Generation
Cosmos-Predict2.5-2B/robot/multiview-agibot
Cosmos-Predict2.5-2B/robot/multiview-basic
Evaluation
We evaluate both our model and the baseline on 80 in-the-wild robotic manipulation videos across 16 diverse camera trajectories. Multiview post-training significantly improves cross-view synchronization (Sampson Error) with no cost to single-camera accuracy.
| Model | Camera Accuracy | View Synchronization | |
|---|---|---|---|
| TransErr ↓ | RotErr (rad) ↓ | Sampson Error (px) ↓ | |
| Predict2.5-2B/robot/singleview-basic | 0.08 | 0.19 | 26.61 |
| Predict2.5-2B/robot/multiview-basic | 0.08 | 0.20 | 19.73 |
Cosmos-Predict2.5-2B/robot/action-cond
Ground Truth vs Cosmos-Predict2.5-2B vs Cosmos-Predict1-7B
| Method | PSNR ↑ | SSIM ↑ | Latent L2 ↓ | FVD ↓ |
|---|---|---|---|---|
| Cosmos-Predict1-7B/robot/action-cond | 21.14 | 0.82 | 0.32 | 190 |
| Cosmos-Predict2.5-2B/robot/action-cond | 24.95 | 0.85 | 0.28 | 146 |
VLA Training
GPT represents the evaluation from GPT-4o, and Qwen represents the evaluation from Qwen2.5VL. -sft represents fine-tuned variants. Best is bolded and second best is underlined.
| Object | Behavior | Env | ||||
|---|---|---|---|---|---|---|
| GPT | Qwen | GPT | Qwen | GPT | Qwen | |
| Hunyuan-sft | 38.0 | 26.0 | 38.3 | 10.6 | 27.6 | 27.6 |
| CogVideoX-sft | 72.0 | 38.0 | 44.0 | 28.0 | 55.2 | 41.4 |
| WAN2.1-sft | 72.0 | 58.0 | 72.3 | 55.3 | 48.3 | 65.5 |
| Cosmos2-sft | 90.0 | 62.0 | 59.6 | 61.7 | 69.0 | 65.5 |
| Cosmos2.5-sft | 91.8 | 69.4 | 70.2 | 59.6 | 69.0 | 69.0 |
Citation
Please cite as NVIDIA et al. using the following BibTex:
@article{nvidia2025worldsimulationvideofoundation,
title={World Simulation with Video Foundation Models for Physical AI},
author={NVIDIA and Ali, Arslan and Bai, Junjie and Bala, Maciej and Balaji, Yogesh and Blakeman, Aaron and Cai, Tiffany and Cao, Jiaxin and Cao, Tianshi and Cha, Elizabeth and Chao, Yu-Wei and Chattopadhyay, Prithvijit and Chen, Mike and Chen, Yongxin and Chen, Yu and Cheng, Shuai and Cui, Yin and Diamond, Jenna and Ding, Yifan and Fan, Jiaojiao and Fan, Linxi and Feng, Liang and Ferroni, Francesco and Fidler, Sanja and Fu, Xiao and Gao, Ruiyuan and Ge, Yunhao and Gu, Jinwei and Gupta, Aryaman and Gururani, Siddharth and El Hanafi, Imad and Hassani, Ali and Hao, Zekun and Huffman, Jacob and Jang, Joel and Jannaty, Pooya and Kautz, Jan and Lam, Grace and Li, Xuan and Li, Zhaoshuo and Liao, Maosheng and Lin, Chen-Hsuan and Lin, Tsung-Yi and Lin, Yen-Chen and Ling, Huan and Liu, Ming-Yu and Liu, Xian and Lu, Yifan and Luo, Alice and Ma, Qianli and Mao, Hanzi and Mo, Kaichun and Nah, Seungjun and Narang, Yashraj and Panaskar, Abhijeet and Pavao, Lindsey and Pham, Trung and Ramezanali, Morteza and Reda, Fitsum and Reed, Scott and Ren, Xuanchi and Shao, Haonan and Shen, Yue and Shi, Stella and Song, Shuran and Stefaniak, Bartosz and Sun, Shangkun and Tang, Shitao and Tasmeen, Sameena and Tchapmi, Lyne and Tseng, Wei-Cheng and Varghese, Jibin and Wang, Andrew Z. and Wang, Hao and Wang, Haoxiang and Wang, Heng and Wang, Ting-Chun and Wei, Fangyin and Xu, Jiashu and Yang, Dinghao and Yang, Xiaodong and Ye, Haotian and Ye, Seonghyeon and Zeng, Xiaohui and Zhang, Jing and Zhang, Qinsheng and Zheng, Kaiwen and Zhu, Andrew and Zhu, Yuke},
journal={arXiv preprint arXiv:2511.00062},
year={2025}
}