Toronto AI Lab
SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

1 NVIDIA
2 University of Toronto
3 Vector Institute
4 Shanghai Jiao Tong University
5 University of Cambridge
6 National University of Singapore

* Equal Contribution

NeurIPS 2024

SCube can reconstruct millions of Gaussians with a range of 102.4m \(\times\) 102.4m in 20 seconds from sparse views (only 3 images).

Abstract


We present SCube, a novel method for reconstructing large-scale 3D scenes (geometry, appearance, and semantics) from a sparse set of posed images. Our method encodes reconstructed scenes using a novel representation VoxSplat, which is a set of 3D Gaussians supported on a high-resolution sparse-voxel scaffold. To reconstruct a VoxSplats from images, we employ a hierarchical voxel latent diffusion model conditioned on the input images followed by a feedforward appearance prediction model. The diffusion model generates high-resolution grids progressively in a coarse-to-fine manner, and the appearance network predicts a set of Gaussians within each voxel. From as few as 3 non-overlapping input images, SCube can generate millions of Gaussians with a \(1024^3\) voxel grid spanning hundreds of meters in 20 seconds. We show the superiority of SCube compared to prior art using the Waymo self-driving dataset on 3D reconstruction and demonstrate its applications, such as LiDAR simulation and text-to-scene generation.

Given sparse input images with little or no overlap, our model reconstructs a high-resolution and large-scale scene in 3D represented with VoxSplats, ready to be used for novel view synthesis or LiDAR simulation.

Method


Framework: SCube consists of two stages: (1) We reconstruct a sparse voxel grid with semantic logit conditioned on the input images using a conditional latent 3D diffusion model. (2) We predict the appearance of the foreground scene as voxel-bounded 3D Gaussians and a sky panorama using a feedforward network. Our method allows us to synthesize novel views in a fast and accurate manner, along with many other applications.

3D Scene Reconstruction


From 3 Few-Overlapping Images

The Voxel size is 0.1m\(^3\) and the scene spans 102.4m \(\times\) 102.4m. The reconstruction is done in 20 seconds.

More Results







Text-to-3D Scene Generation



LiDAR Simulation


LiDAR simulation based on reconstructed 3D Gaussians Scene.

SCube with Long Sequence Input. Up: reconstructed scene with appearance. Down: LiDAR simulation result. We chunk the long sequence into clips and apply out method iteratively

Citation



    @inproceedings{
      ren2024scube,
      title={SCube: Instant Large-Scale Scene Reconstruction using VoxSplats},
      author={Ren, Xuanchi and Lu, Yifan and Liang, Hanxue and Wu, Jay Zhangjie and 
        Ling, Huan and Chen, Mike and Fidler, Sanja annd Williams, Francis and Huang, Jiahui},
      booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
      year={2024},
    }
  

Paper