Cosmos-Transfer1 — Cosmos Lab

Abstract

We introduce Cosmos-Transfer1, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real.

We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack.

Adaptive MultiControl

Cosmos-Transfer1 contains multiple control branches to extract control information from different modality inputs such as segmentation, depth, and edge. A spatiotemporal control map weights the outputs of the control branches before channeling them back to the main generation branch, enabling the model to leverage the most relevant modalities in different regions for optimal output quality.

HD Map

LiDAR

Output

Input

Segmentation

Output

Depth

Edge

Segmentation

Blur

Output

Depth

Edge

Segmentation

Blur

Output

Depth

Edge

Segmentation

Blur

Output

Robotics Sim2Real Data Generation

Cosmos-Transfer1 enables high-fidelity sim-to-real transfer for robotics by converting simulated environments into photorealistic video, preserving the structural and motion properties needed for robot policy training.

Simulation Input

Output 1

Output 2

Output 3

Output 4

Simulation Input

Output 1

Output 2

Output 3

Simulation Input

Output 1

Output 2

Output 3

Output 4

Simulation Input

Output 1

Output 2

Output 3

Output 4

Simulation Input

Output 1

Output 2

Output 3

Output 4

Autonomous Driving Data Enrichment

Cosmos-Transfer1 enriches autonomous driving datasets by applying multimodal world transfer control to generate diverse, photorealistic driving scenarios from structured simulation data, supporting scalable training for autonomous vehicle perception and planning systems.

LiDAR

HD Map

Output 1

Output 2

Output 3

Output 4

Output 5

LiDAR

HD Map

Output 1

Output 2

Output 3

Output 4

LiDAR

HD Map

Output 1

Output 2

Output 3

Output 4

LiDAR

HD Map

Output 1

Output 2

Output 3

Output 4

LiDAR

HD Map

Output 1

Output 2

Output 3

Output 4

Citation

Please cite as NVIDIA et al. using the following BibTex:

@article{nvidia2025cosmostransfer1conditionalworldgeneration,
  title={Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control},
  author={NVIDIA and Abu Alhaija, Hassan and Alvarez, Jose and Bala, Maciej and Cai, Tiffany and Cao, Tianshi and Cha, Liz and Chen, Joshua and Chen, Mike and Ferroni, Francesco and Fidler, Sanja and Fox, Dieter and Ge, Yunhao and Gu, Jinwei and Hassani, Ali and Isaev, Michael and Jannaty, Pooya and Lan, Shiyi and Lasser, Tobias and Ling, Huan and Liu, Ming-Yu and Liu, Xian and Lu, Yifan and Luo, Alice and Ma, Qianli and Mao, Hanzi and Ramos, Fabio and Ren, Xuanchi and Shen, Tianchang and Tang, Shitao and Wang, Ting-Chun and Wu, Jay and Xu, Jiashu and Xu, Stella and Xie, Kevin and Ye, Yuchong and Yang, Xiaodong and Zeng, Xiaohui and Zeng, Yu},
  journal={arXiv preprint arXiv:2503.14492},
  year={2025}
}