Toronto AI Lab
InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

1 NVIDIA
2 Shanghai Jiao Tong University
3 University of Toronto
4 Vector Institute
5 University of Southern California

* Equal Contribution

Large-scale, Dynamic, Controllable, High-fidelity 3D Gaussian Scene Generation

InfiniCube generates large-scale 3D scenes (300m\(\times\) 400m ~100,000m\(^2\)) given HD maps, bounding boxes and text prompts as controls.

InfiniCube allows for the generation of fully controlled dynamic objects in the scene.

Abstract


We present InfiniCube, a scalable method for generating unbounded dynamic 3D driving scenes with high fidelity and controllability. We leverage the recent advancements in scalable 3D representation and video models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned sparse-voxel-based 3D generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of carefully designed pixel-aligned guidance buffers, synthesizing a consistent appearance. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift the dynamic videos to dynamic 3D Gaussians with controllable objects.

Method Overview


3-Stage generation framework: (1) Conditioned on HD maps and bounding boxes, we first generate a 3D voxel world representation. (2) We then render the voxel world into several guidance buffers to guide video generation. (3) The generated video and voxel world are jointly fed into a feed-forward dynamic reconstruction module to obtain the final 3DGS representation.

Stage 1: Unbounded Voxel World Generation


This step takes the HD map and the 3D bounding boxes as input and synthesizes a corresponding 3D voxel world with semantic labels. We extrapolate the voxel world in the latent space for unbounded generation.

Stage 2: Guidance Buffer Conditioned Video Generation


We use ControlNet to generate the initial frame image and train an image-to-video model using both semantic and coordinate buffers as conditions. For long videos, we employ an auto-regressive technique that reuses the last-frame latent for the next time generation.

Video Generation (20-second long and 10 fps)

semantic buffers on the left, generated videos on the right

Appearance Control

Using the text prompt, we can specify different weather appearance: daytime-sunny / daytime-foggy / nighttime-cloudless.

Stage 3: Dynamic 3D Gaussians Scene Generation


Based on generated voxel world and videos, we apply a fast feed-forward method to reconstruct the 3D Gaussian scene. The reconstruction includes a voxel branch for the background and a per-frame pixel branch for dynamic objects.


Citation



@misc{lu2024infinicube,
    title={InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video 
      Models}, 
    author={Yifan Lu and Xuanchi Ren and Jiawei Yang and Tianchang Shen and Zhangjie Wu and Jun Gao and 
      Yue Wang and Siheng Chen and Mike Chen and Sanja Fidler and Jiahui Huang},
    year={2024},
    eprint={2412.03934},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2412.03934}, 
}
      

Paper