We present \(\mathcal{X}^3\) (pronounced XCube), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to \(1024^3\) in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m\(\times\)100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D.
A sparse voxel hierarchy is a sequence of coarse-to-fine 3D sparse voxel grids such that every fine voxel is contained within a coarser voxel.
Our methods trains a hierarchy of latent diffusion models over the sparse voxel grids \(\mathcal{G} = \{G_1, ..., G_L\}\). Sparse voxel grids within the hierarchy are first encoded into compact latent representations using a sparse structure VAE. The hierarchical latent diffusion model then learns to generate each level of the latent representation conditioned on the coarser level in a cascaded fashion. The generated high-resolution voxel grids contain various attributes for different applications. Note that technically \(X_1\) is a dense latent grid, but illustrated as a sparse one for clarity.
Our method is capable of generating intricate geometry and thin structures.
We generate high-resolution shapes from text. Textures are generated by a separate model as a post-processing step.
Unconditional scene generation with semantics trained on autonomous driving data.
The result for synthetic dataset is mainly highlighting the spatial resolution our model is able to operate on.
Conditioning our model on single LiDAR scans and accumulating to reconstruct large drives.
@article{ren2023xcube,
title={XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies},
author={Xuanchi Ren and Jiahui Huang and Xiaohui Zeng and Ken Museth
and Sanja Fidler and Francis Williams},
journal={arXiv preprint},
year={2023}
}
The authors appreciate the feedback received from James Lucas and Jun Gao during the project. We also thank Jonah Philion for his generous help on computing.