We present SCube, a novel method for reconstructing large-scale 3D scenes (geometry, appearance, and semantics) from a sparse set of posed images. Our method encodes reconstructed scenes using a novel representation VoxSplat, which is a set of 3D Gaussians supported on a high-resolution sparse-voxel scaffold. To reconstruct a VoxSplats from images, we employ a hierarchical voxel latent diffusion model conditioned on the input images followed by a feedforward appearance prediction model. The diffusion model generates high-resolution grids progressively in a coarse-to-fine manner, and the appearance network predicts a set of Gaussians within each voxel. From as few as 3 non-overlapping input images, SCube can generate millions of Gaussians with a \(1024^3\) voxel grid spanning hundreds of meters in 20 seconds. We show the superiority of SCube compared to prior art using the Waymo self-driving dataset on 3D reconstruction and demonstrate its applications, such as LiDAR simulation and text-to-scene generation.
@inproceedings{
ren2024scube,
title={SCube: Instant Large-Scale Scene Reconstruction using VoxSplats},
author={Ren, Xuanchi and Lu, Yifan and Liang, Hanxue and Wu, Jay Zhangjie and
Ling, Huan and Chen, Mike and Fidler, Sanja annd Williams, Francis and Huang, Jiahui},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
}