We present \(\mathcal{X}^3\) (pronounced XCube), a novel generative model for high-resolution
sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels
with a finest effective resolution of up to \(1024^3\) in a feed-forward fashion without
time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent
diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner
using a custom framework built on the highly efficient VDB data structure. Apart from generating
high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at
scales of 100m\(\times\)100m with a voxel size as small as 10cm. We observe clear qualitative and
quantitative improvements over past approaches. In addition to unconditional generation, we show
that our model can be used to solve a variety of tasks such as user-guided editing, scene
completion from a single scan, and text-to-3D.
Video
Method
Object-level Generation
Single Category Generation on ShapeNet
Our method is capable of generating intricate geometry and thin structures.
Text-to-3D Results on Objaverse
A campfire.
An eagle head.
A 3D model of skull.
A 3D model of strawberry.
A chair that looks like a root.
A small cactus planted in a clay pot.
We generate high-resolution shapes from text. Textures are generated by a separate model as a post-processing step.
Large-scale Scene-level Generation
Unconditional Generation
Unconditional scene generation with semantics trained on autonomous driving data.
Spatial Representation Power
Our model trained on a small sythetic dataset highlights the detail and spatial resolution XCube can achieve.
Single-scan Conditional Generation.
Citation
@inproceedings{ren2024xcube,
title={XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies},
author={Ren, Xuanchi and Huang, Jiahui and Zeng, Xiaohui and Museth, Ken
and Fidler, Sanja and Williams, Francis},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}
Paper
Acknowledgment
The authors acknowledge James Lucas
and Jun Gao for their help proofreading the paper and for their feedback during the project. We also thank
Jonah Philion for generously sharing his compute quota allowing us to train our model before the deadline.