We present InfiniCube, a scalable method for generating unbounded dynamic 3D driving scenes with high fidelity and controllability. We leverage the recent advancements in scalable 3D representation and video models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned sparse-voxel-based 3D generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of carefully designed pixel-aligned guidance buffers, synthesizing a consistent appearance. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift the dynamic videos to dynamic 3D Gaussians with controllable objects.
We use ControlNet to generate the initial frame image and train an image-to-video model using both semantic and coordinate buffers as conditions. For long videos, we employ an auto-regressive technique that reuses the last-frame latent for the next time generation.
@misc{lu2024infinicube,
title={InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video
Models},
author={Yifan Lu and Xuanchi Ren and Jiawei Yang and Tianchang Shen and Zhangjie Wu and Jun Gao and
Yue Wang and Siheng Chen and Mike Chen and Sanja Fidler and Jiahui Huang},
year={2024},
eprint={2412.03934},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.03934},
}