Cosmos-Drive-Dream: Scalable Synthetic Driving Data Generation with World Foundation Models

Abstract

Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dream - a synthetic data generation (SDG) pipeline that aims to generate challenging scenarios to facilitate downstream tasks such as perception and driving policy training. Powering this pipeline is Cosmos-Drive-Dream, a suite of models specialized from NVIDIA Cosmos world foundation model for the driving domain and are capable of controllable, high-fidelity, multi-view, and spatiotemporally consistent driving video generation. We showcase the utility of these models by applying Cosmos-Drive-Dream to scale the quantity and diversity of driving datasets with high-fidelity and challenging scenarios. Experimentally, we demonstrate that our generated data helps in mitigating long-tail distribution problems and enhances generalization in downstream tasks such as 3D lane detection, 3D object detection and driving policy learning. We open source our model weights through the NVIDIA's Cosmos platform, pipeline toolkit, and a synthetic dataset which consists of 79,880 clips.

Overview of our Cosmos-Drive-Dream pipeline. Starting from either structured labels or in-the-wild video, we generated pixel-aligned HDMap condition video (Step ①). Then we leverages a prompt rewriter to generate diverse prompts and synthesize single-view videos (Step ②). Each single-view video is then expanded into multiple views (Step ③). Finally, a Vision-Language Model (VLM) filter performs rejection sampling to automatically discard low-quality samples, yielding a high-quality, diverse SDG dataset (Step ④).

Cosmos-Drive's model suite. Top Left: We begin with a pretrained world foundation model (WFM) and post-train it on RDS dataset to obtain driving-specific WFMs. This model is further post-trained into three models, which constitutes Cosmos-Drive. Top Right: Precise layout control model (Cosmos-Transfer1-7B-Sample-AV), which generates single-view driving videos from HDMap and optional LiDAR depth videos; Bottom Left: Multi-view expansion model (Cosmos-7B-Single2Multiview-Sample-AV), which synthesizes consistent multi-view videos from a single view; Bottom Right: In-the-wild video annotation model (Cosmos-7B-Annotate-Sample-AV), which predicts HDMap and depth from in-the-wild driving videos.