Abstract

Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open-source and our models open-weight with permissive licenses.

Pre-trained World Foundation Model

Pre-trained WFMs are world model generalists trained with large-scale, diverse video datasets capturing different aspects of real-world physics. These pre-trained world foundation models can be specialized to a target Physical AI setup through post-training. Usually, the datasets for post-training are "prompt"–video pairs collected from the target Physical AI setup. The prompt can be in the form of action commands, trajectory, instructions, etc. As the pre-trained WFM provides a great foundation, the dataset for post-training can be much smaller. This pre-training-and-post-training yields an efficient strategy for building a Physical AI system.

Tokenizer

Cosmos-0.1-Tokenizer vs. Cosmos-Tokenizer1

Video Generation

Cosmos-Predict1-4B / Cosmos-Predict1-12B

Text-to-World

Cosmos-Predict1-7B / Cosmos-Predict1-14B

Video-to-World (AR)

Cosmos-Predict1-5B / Cosmos-Predict1-13B

Video-to-World (Diffusion)

Cosmos-Predict1-7B / Cosmos-Predict1-14B

Post-trained World Foundation Model

Camera Control

Instruction-based Robotics Prediction

Organize books by placing them vertically on a shelf.
Fold a green fabric item on a table.
Sort items from a box in an open-plan office.
Clean up soda cans from the floor and place them into a trash can.
Prepare coffee using a machine in a kitchen.
Grip and elevate a green object from a box on a tidy worktable.
Pick up an electronic device from a table and place it in a bin.
Pick up a red geometric object and place it near or in the blue bowl on a table.

Action-based Video Prediction for Robotics

Predicted vs. GT
Predicted vs. GT

Multiview Generation for Automotive Driving

Citation

Please cite as NVIDIA et al. using the following BibTex:

@article{nvidia2025cosmosworldfoundationmodel,
  title={Cosmos World Foundation Model Platform for Physical AI},
  author={NVIDIA and Agarwal, Niket and Ali, Arslan and Bala, Maciej and Balaji, Yogesh and Barker, Erik and Cai, Tiffany and Chattopadhyay, Prithvijit and Chen, Yongxin and Cui, Yin and Ding, Yifan and Dworakowski, Daniel and Fan, Jiaojiao and Fenzi, Michele and Ferroni, Francesco and Fidler, Sanja and Fox, Dieter and Ge, Songwei and Ge, Yunhao and Gu, Jinwei and Gururani, Siddharth and He, Ethan and Huang, Jiahui and Huffman, Jacob and Jannaty, Pooya and Jin, Jingyi and Kim, Seung Wook and Kl\'{a}r, Gergely and Lam, Grace and Lan, Shiyi and Leal-Taixe, Laura and Li, Anqi and Li, Zhaoshuo and Lin, Chen-Hsuan and Lin, Tsung-Yi and Ling, Huan and Liu, Ming-Yu and Liu, Xian and Luo, Alice and Ma, Qianli and Mao, Hanzi and Mo, Kaichun and Mousavian, Arsalan and Nah, Seungjun and Niverty, Sriharsha and Page, David and Paschalidou, Despoina and Patel, Zeeshan and Pavao, Lindsey and Ramezanali, Morteza and Reda, Fitsum and Ren, Xiaowei and Sabavat, Vasanth Rao Naik and Schmerling, Ed and Shi, Stella and Stefaniak, Bartosz and Tang, Shitao and Tchapmi, Lyne and Tredak, Przemek and Tseng, Wei-Cheng and Varghese, Jibin and Wang, Hao and Wang, Haoxiang and Wang, Heng and Wang, Ting-Chun and Wei, Fangyin and Wei, Xinyue and Wu, Jay Zhangjie and Xu, Jiashu and Yang, Wei and Yen-Chen, Lin and Zeng, Xiaohui and Zeng, Yu and Zhang, Jing and Zhang, Qinsheng and Zhang, Yuxuan and Zhao, Qingqing and Zolkowski, Artur},
  journal={arXiv preprint arXiv:2501.03575},
  year={2025}
}