Cosmos-Transfer2.5 — Cosmos Lab

Abstract

Cosmos-Transfer2.5 is a conditional world generation model with adaptive multimodal control, built on top of Cosmos-Predict2.5, that produces high-quality world simulations conditioned on multiple control inputs. These inputs can take different modalities—including edges, blurred video, segmentation maps, and depth maps—and may originate either from a physics simulation engine such as NVIDIA IsaacSim or from real-world video data.

In terms of architecture, Cosmos-Transfer2.5 follows the general design of Cosmos-Transfer1, but with a key modification. Whereas Cosmos-Transfer1 inserts four control blocks sequentially at the start of the main branch, Cosmos-Transfer2.5 distributes its four control blocks more evenly by inserting one after every seven blocks in the main branch. This design preserves the total number of control blocks while integrating conditioning information more gradually throughout the network.

Compared to Cosmos-Transfer1-7B, Cosmos-Transfer2.5-2B is 3.5 times smaller, with better prompt and physics alignment, and results in less hallucination and error accumulation for long video generations.

Adaptive MultiControl Demonstrations

Adaptive MultiControl (Input/Output)

Depth — Input

Depth — Output

Edge — Input

Edge — Output

Segmentation — Input

Segmentation — Output

Blur — Input

Blur — Output

Comparison with Cosmos-Transfer1

Less Error Accumulation (Long Video Generation)

Sample 1 — Cosmos-Transfer1

Sample 1 — Cosmos-Transfer2.5

Sample 2 — Cosmos-Transfer1

Sample 2 — Cosmos-Transfer2.5

Sample 3 — Cosmos-Transfer1

Sample 3 — Cosmos-Transfer2.5

Sample 4 — Cosmos-Transfer1

Sample 4 — Cosmos-Transfer2.5

Policy Learning Effectiveness

Alignment Evaluation

We compare single control models (each conditioned on a single modality) and multi-modal variants that use spatially uniform weights. For the multi-modal cases, "Uniform Weights" denotes the full model that integrates all four control modalities (each weighted at 0.25). Best results are in bold; second-best are underlined.

Model	Blur Alignment	Edge Alignment	Depth Alignment	Segmentation Alignment	Overall Quality ↑
Model	Blur SSIM ↑	Edge F1 ↑	Depth si-RMSE ↓	Mask mIoU ↑	Overall Quality ↑
Cosmos-Transfer1-7B [Blur]	0.89	0.20	0.66	0.73	6.56
Cosmos-Transfer1-7B [Edge]	0.77	0.38	0.85	0.73	6.76
Cosmos-Transfer1-7B [Depth]	0.67	0.15	0.76	0.71	6.89
Cosmos-Transfer1-7B [Seg]	0.62	0.11	1.13	0.70	6.02
Cosmos-Transfer1-7B Uniform Weights	0.82	0.26	0.70	0.74	9.24
Cosmos-Transfer2.5-2B [Blur]	0.90	0.26	0.59	0.75	9.75
Cosmos-Transfer2.5-2B [Edge]	0.79	0.49	0.76	0.75	8.73
Cosmos-Transfer2.5-2B [Depth]	0.71	0.19	0.70	0.73	8.85
Cosmos-Transfer2.5-2B [Seg]	0.68	0.14	1.02	0.71	8.81
Cosmos-Transfer2.5-2B Uniform Weights	0.87	0.41	0.67	0.76	9.31

Normalized Relative Dover Score vs Chunk Index — error accumulation comparison

These plots show the Normalized Relative Dover Score vs Chunk Index for auto-regressive multi-trunk long video generation where each trunk is 93 frames. As shown, for all four control modalities (edge/blur/depth/seg), compared to Cosmos-Transfer1-7B (blue curves), Cosmos-Transfer2.5-2B (green curves) has much less reduction in RNDS along the chunk index dimension, which shows less hallucination and error accumulation for long videos.

Autonomous Driving Simulation

Comparison with Cosmos-Transfer1

Cosmos-Transfer2.5-2B/auto/multiview

Controllable traffic light states and complex intersections with multiple actors

Controllable overpasses and bridges

Different environmental conditions like nighttime

Better visual quality

Controlled generations

3D vector map and actors

Sample 1 — Transfer1 (black vehicle is not hallucinated in rear tele view)

Sample 1 — Transfer2.5

Sample 2 — Transfer1 (black vehicle is not hallucinated)

Sample 2 — Transfer2.5

Sample 3 — Transfer1

Sample 1 — Rear tele

Sample 1 — Rear right

Sample 2 — Rear right

Sample 2 — Rear tele

Sample 3 — Front wide

Visual Metrics on Generated Multi-View Videos

We use a 1,000 multi-view clip dataset in RQS-HQ (Ren et al., 2025), with HD map, as well as human-labeled lanes and cuboids. We observe a significant boost from Transfer1-7B-Sample-AV (up to 2.3x) in FVD/FID scores while remaining competitive in temporal and cross-camera Sampson error compared to real videos.

Model	FVD StyleGAN ↓	FVD I3D ↓	FID ↓	TSE ↓	CSE ↓
Transfer2.5-2B/auto/multiview	24.222	25.692	20.022	1.246	2.310
Transfer1-7B-Sample-AV	56.606	60.660	22.633	1.017	1.835
Real Videos (Reference)	—	—	—	1.193	1.832

Lane and Bounding Box Detection on Generated Multi-View Videos

To test adherence to the control signals, we measure the detection performance of 3D-cuboid and lane detection models on generated videos, and compare these with the ground truth labels. Following the protocol described in (Ren et al., 2025), we use a monocular 3D lane detector, LATR (Luo et al., 2023), for evaluating 3D lane detection tasks, and a temporal 3D object detector, BEVFormer (Li et al., 2022), for evaluating 3D cuboid detection tasks. We observe a substantial improvement (up to 60%) in detection metrics compared to Transfer1-7B-Sample-AV.

Model	Cuboids			Lanes
Model	LET-AP ↑	LET-APL ↑	LET-APH ↑	F1 ↑	x-error (far) ↓	Category Acc. ↑
Transfer2.5-2B/auto/multiview	0.394	0.254	0.383	0.637	0.487	0.904
Transfer1-7B-Sample-AV	0.243	0.154	0.236	0.604	0.524	0.899
Real Videos (Reference)	0.476	0.319	0.462	0.637	0.480	0.905

Robotics Sim2Real

Cosmos-Transfer2.5-2B

Robotics sample 1

Robotics sample 2

Robotics sample 3

Cosmos-Transfer1-13B

Robotics sample 1 — alternate view

Robotics sample 1 — rear right

Robotics sample 2 — rear tele

Robotics sample 3 — front wide

Real-Robot Quantitative Evaluation

We train three robot policy models. The Base model is trained without data augmentation. The Baseline model is trained with standard random data augmentation. The Proposed model is trained with Cosmos-Transfer2.5-2B data augmentation. We conduct our experiments on a semi-humanoid robotic platform equipped with two 7-DoF Kinova Gen3 arms, each fitted with a Robotiq 2F-140 gripper. We roll-over these policy models on the robot under novel environment settings (e.g., changing objects, backgrounds, lighting). As the table shows, compared to Base and Baseline models, the Proposed model achieved much higher success rates, showing its better generalization ability in new environments.

	Base	Mangosteen	Orange Bowl	Beige Table	Black Table	Light On	Distractors	Black Cabinet	Open Drawers	Combo	Total
Base	1/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	1/30
Baseline	3/3	0/3	2/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	5/30
Proposed	3/3	3/3	3/3	1/3	1/3	2/3	3/3	2/3	3/3	3/3	24/30

Citation

Please cite as NVIDIA et al. using the following BibTex:

@article{nvidia2025worldsimulationvideofoundation,
  title={World Simulation with Video Foundation Models for Physical AI},
  author={NVIDIA and Ali, Arslan and Bai, Junjie and Bala, Maciej and Balaji, Yogesh and Blakeman, Aaron and Cai, Tiffany and Cao, Jiaxin and Cao, Tianshi and Cha, Elizabeth and Chao, Yu-Wei and Chattopadhyay, Prithvijit and Chen, Mike and Chen, Yongxin and Chen, Yu and Cheng, Shuai and Cui, Yin and Diamond, Jenna and Ding, Yifan and Fan, Jiaojiao and Fan, Linxi and Feng, Liang and Ferroni, Francesco and Fidler, Sanja and Fu, Xiao and Gao, Ruiyuan and Ge, Yunhao and Gu, Jinwei and Gupta, Aryaman and Gururani, Siddharth and El Hanafi, Imad and Hassani, Ali and Hao, Zekun and Huffman, Jacob and Jang, Joel and Jannaty, Pooya and Kautz, Jan and Lam, Grace and Li, Xuan and Li, Zhaoshuo and Liao, Maosheng and Lin, Chen-Hsuan and Lin, Tsung-Yi and Lin, Yen-Chen and Ling, Huan and Liu, Ming-Yu and Liu, Xian and Lu, Yifan and Luo, Alice and Ma, Qianli and Mao, Hanzi and Mo, Kaichun and Nah, Seungjun and Narang, Yashraj and Panaskar, Abhijeet and Pavao, Lindsey and Pham, Trung and Ramezanali, Morteza and Reda, Fitsum and Reed, Scott and Ren, Xuanchi and Shao, Haonan and Shen, Yue and Shi, Stella and Song, Shuran and Stefaniak, Bartosz and Sun, Shangkun and Tang, Shitao and Tasmeen, Sameena and Tchapmi, Lyne and Tseng, Wei-Cheng and Varghese, Jibin and Wang, Andrew Z. and Wang, Hao and Wang, Haoxiang and Wang, Heng and Wang, Ting-Chun and Wei, Fangyin and Xu, Jiashu and Yang, Dinghao and Yang, Xiaodong and Ye, Haotian and Ye, Seonghyeon and Zeng, Xiaohui and Zhang, Jing and Zhang, Qinsheng and Zheng, Kaiwen and Zhu, Andrew and Zhu, Yuke},
  journal={arXiv preprint arXiv:2511.00062},
  year={2025}
}