InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

Large-scale, Dynamic, Controllable, High-fidelity 3D Gaussian Scene Generation

InfiniCube generates large-scale 3D scenes (300m\(\times\) 400m ~100,000m\(^2\)) given HD maps, bounding boxes and text prompts as controls.

InfiniCube allows for the generation of fully controlled dynamic objects in the scene.

Abstract

We present InfiniCube, a scalable method for generating unbounded dynamic 3D driving scenes with high fidelity and controllability. We leverage the recent advancements in scalable 3D representation and video models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned sparse-voxel-based 3D generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of carefully designed pixel-aligned guidance buffers, synthesizing a consistent appearance. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift the dynamic videos to dynamic 3D Gaussians with controllable objects.

Method Overview

Stage 1: Unbounded Voxel World Generation

This step takes the HD map and the 3D bounding boxes as input and synthesizes a corresponding 3D voxel world with semantic labels. We extrapolate the voxel world in the latent space for unbounded generation.

Stage 2: Guidance Buffer Conditioned Video Generation

We use ControlNet to generate the initial frame image and train an image-to-video model using both semantic and coordinate buffers as conditions. For long videos, we employ an auto-regressive technique that reuses the last-frame latent for the next time generation.

Video Generation (20-second long and 10 fps)

semantic buffers on the left, generated videos on the right

Appearance Control

Using the text prompt, we can specify different weather appearance: daytime-sunny / daytime-foggy / nighttime-cloudless.

Stage 3: Dynamic 3D Gaussians Scene Generation

Based on generated voxel world and videos, we apply a fast feed-forward method to reconstruct the 3D Gaussian scene. The reconstruction includes a voxel branch for the background and a per-frame pixel branch for dynamic objects.

Citation


@misc{lu2024infinicube,
    title={InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video 
      Models}, 
    author={Yifan Lu and Xuanchi Ren and Jiawei Yang and Tianchang Shen and Zhangjie Wu and Jun Gao and 
      Yue Wang and Siheng Chen and Mike Chen and Sanja Fidler and Jiahui Huang},
    year={2024},
    eprint={2412.03934},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2412.03934}, 
}