NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We leverage Latent Diffusion Models that have been successfully utilized for efficient high-quality 2D content creation. We first train a scene auto-encoder to express a set of image and pose pairs as a neural field, represented as density and feature voxel grids that can be projected to produce novel views of the scene.

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos.

Karsten Kreis

Karsten Kreis is a Senior Research Scientist at NVIDIA Research focusing on generative AI.

Generalizable One-shot 3D Neural Head Avatar

We present a method that reconstructs and animates a 3D head avatar from a single-view portrait image. Existing methods either involve time-consuming optimization for a specific person with multiple images, or they struggle to synthesize intricate appearance details beyond the facial region. To address these limitations, we propose a framework that not only generalizes to unseen identities based on a single-view image without requiring person-specific optimization, but also captures characteristic details within and beyond the face area (e.g. hairstyle, accessories, etc.).

Convolutional State Space Models for Long-Range Spatiotemporal Modeling

Effectively modeling long spatiotemporal sequences is challenging due to the need to model complex spatial correlations and long-range temporal dependencies simultaneously. ConvLSTMs attempt to address this by updating tensor-valued states with recurrent neural networks, but their sequential computation makes them slow to train. In contrast, Transformers can process an entire spatiotemporal sequence, compressed into tokens, in parallel. However, the cost of attention scales quadratically in length, limiting their scalability to longer sequences.

Huck Yang

I am a Sr. Research Scientist at NV Research, mainly based in Taipei and occasionally in Sunnyvale, and I work closely with the NV Research US, Applied Research teams, and University Collaboration in the US and Asian Pacific. 

I obtained my Ph.D. and M.Sc. from Georgia Institute of Technology, USA with Wallace H. Coulter fellowship and my B.Sc. from National Taiwan University. My dissertation is on "A Perturbation Approach to Differential Privacy for Deep Learning based Speech Processing."

Dale Durran

Durran has a 25% appointment as a Principal Research Scientist in Climate Modeling at NVIDIA and a 60% appointment as a Professor of Atmospheric Sciences at the University of Washington.  At NVIDIA his research focus in on deep learning earth-system modeling for sub-seasonal and seasonal forecasting, forecast ensembles, and generative methods for  fine-scale modeling of convective precipitation and other mesoscale fields.