Toronto AI Lab
NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Scene Generation with Hierarchical Latent Diffusion Models

1NVIDIA,  2University of Toronto,  3Vector Institute,  4CSAIL, MIT,  5University of Waterloo,  6University of Tübingen, Tübingen AI Center,  7LMU Munich
2University of Toronto
3Vector Institute
5University of Waterloo
6University of Tübingen, Tübingen AI Center
7LMU Munich

* Equal Contribution
☥ Work done during an internship at NVIDIA

CVPR 2023

Overview. We introduce NeuralField-LDM (NF-LDM), a generative model for complex open-world 3D scenes. To tackle this challenging task, we develop a three-stage pipeline: 1) We first train the scene-autoencoder that encodes RGB images with camera poses into a neural field represented by density and feature voxel grids, 2) We additionally train a latent-autoencoder that compresses the neural field into several smaller latent spaces to be suitable for fitting a generative model, and 3) Given the latent representations of the training dataset, we fit a hierarchical latent diffusion model on them. Finally, we synthesize new scenes by sampling latents with the diffusion model and decoding them into a neural field that can be rendered into a given viewpoint.

Samples from NF-LDM trained on real outdoor in-the-wild driving scenes: We visualize the synthesized scenes by traversing inside the generated voxels with a camera trajectory.

Samples from NF-LDM trained on real outdoor in-the-wild driving scenes.


Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We leverage Latent Diffusion Models that have been successfully utilized for efficient high-quality 2D content creation. We first train a scene auto-encoder to express a set of image and pose pairs as a neural field, represented as density and feature voxel grids that can be projected to produce novel views of the scene. To further compress this representation, we train a latent-autoencoder that maps the voxel grids to a set of latent representations. A hierarchical diffusion model is then fit to the latents to complete the scene generation pipeline. We achieve a substantial improvement over existing state- of-the-art scene generation models. Additionally, we show how NeuralField-LDM can be used for a variety of 3D content creation applications, including conditional scene generation, scene inpainting and scene style manipulation.


Scene Generation with Hierarchical Latent Diffusion Models

Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin Rombach, Antonio Torralba, Sanja Fidler

description PDF
description arXiv version
insert_comment BibTeX

3D Scene Generation

We show synthesized samples from AVD (real-world driving) and Carla [1] datasets. For AVD, we first get a coarse voxel representation that has reasonable geometry & texture, and then optimize it further with a post-optimization step described in Section 3.4 of the paper. In particular, we leverage Score Distillation Sampling (SDS) [2] and devise negative guidance for improving the scene quality.

[1] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. 2017.
[2] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. ICLR, 2023.


In addition to improving quality, we can also modify scene properties, such as time of day and weather, with the same post-optimization technique. We use an image-finetuned Stable Diffusion [3,4] for SDS and feed random images corresponding to the given style query for the post-optimization step.

[3] Robin Rombach, Andreas Blattmann, Dominik Lorenz, PatrickEsser, and BjörnOmmer. High-resolution image synthesis with latent diffusion models. CVPR, 2022.

Scene Editing

Our neural field is based on a voxel-grid representation, which allows easy editing of scenes. Here, we show examples of combining two voxels by directly swapping out the center part of the voxel at the bottom with that of the top one.

Bird's Eye View Conditioned Samples

When there exists additional information associated with training data, the diffusion models can be trained in a conditional manner. We leverage the Bird's Eye View (BEV) segmentation map in Carla and train the model to generate scenes conditioned on a given BEV map. In BEV maps, each color denotes a different semantic class - dark blue: vehicles, pink: sidewalk, purple: drivable road, green: vegetation, and blue: water.


      title={NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models},
      author={Kim, Seung Wook and Brown, Bradley and Yin, Kangxue and Kreis, Karsten and
          Schwarz, Katja and Li, Daiqing and Rombach, Robin and Torralba, Antonio and Fidler, Sanja},
      booktitle={IEEE Conference on Computer Vision and Pattern Recognition ({CVPR})},

Business Inquiries

For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing