InfiniCube generates large-scale 3D scenes (300m\(\times\) 400m ~100,000m\(^2\)) given HD maps, bounding boxes and text prompts as controls.
InfiniCube allows for the generation of fully controlled dynamic objects in the scene.
InfiniCube generates large-scale 3D scenes (300m\(\times\) 400m ~100,000m\(^2\)) given HD maps, bounding boxes and text prompts as controls.
InfiniCube allows for the generation of fully controlled dynamic objects in the scene.
We present InfiniCube, a scalable method for generating unbounded dynamic 3D driving scenes with high fidelity and controllability. We leverage the recent advancements in scalable 3D representation and video models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned sparse-voxel-based 3D generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of carefully designed pixel-aligned guidance buffers, synthesizing a consistent appearance. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift the dynamic videos to dynamic 3D Gaussians with controllable objects.
3-Stage generation framework: (1) Conditioned on HD maps and bounding boxes, we first generate a 3D voxel world representation. (2) We then render the voxel world into several guidance buffers to guide video generation. (3) The generated video and voxel world are jointly fed into a feed-forward dynamic reconstruction module to obtain the final 3DGS representation.
This step takes the HD map and the 3D bounding boxes as input and synthesizes a corresponding 3D voxel world with semantic labels. We extrapolate the voxel world in the latent space for unbounded generation.
We use ControlNet to generate the initial frame image and train an image-to-video model using both semantic and coordinate buffers as conditions. For long videos, we employ an auto-regressive technique that reuses the last-frame latent for the next time generation.
semantic buffers on the left, generated videos on the right
Using the text prompt, we can specify different weather appearance: daytime-sunny / daytime-foggy / nighttime-cloudless.
Based on generated voxel world and videos, we apply a fast feed-forward method to reconstruct the 3D Gaussian scene. The reconstruction includes a voxel branch for the background and a per-frame pixel branch for dynamic objects.
@misc{lu2024infinicube,
title={InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video
Models},
author={Yifan Lu and Xuanchi Ren and Jiawei Yang and Tianchang Shen and Zhangjie Wu and Jun Gao and
Yue Wang and Siheng Chen and Mike Chen and Sanja Fidler and Jiahui Huang},
year={2024},
eprint={2412.03934},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.03934},
}