Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos.

Karsten Kreis

Karsten Kreis is a Principal Research Scientist at NVIDIA Research focusing on generative AI.

Karsten's research interests span both the development of foundational generative AI algorithms and their application across scientific and creative domains. Recently, he has been focusing on generative learning for molecular modeling and is leading efforts in generative modeling for protein design.

Generalizable One-shot 3D Neural Head Avatar

We present a method that reconstructs and animates a 3D head avatar from a single-view portrait image. Existing methods either involve time-consuming optimization for a specific person with multiple images, or they struggle to synthesize intricate appearance details beyond the facial region. To address these limitations, we propose a framework that not only generalizes to unseen identities based on a single-view image without requiring person-specific optimization, but also captures characteristic details within and beyond the face area (e.g. hairstyle, accessories, etc.).

Convolutional State Space Models for Long-Range Spatiotemporal Modeling

Effectively modeling long spatiotemporal sequences is challenging due to the need to model complex spatial correlations and long-range temporal dependencies simultaneously. ConvLSTMs attempt to address this by updating tensor-valued states with recurrent neural networks, but their sequential computation makes them slow to train. In contrast, Transformers can process an entire spatiotemporal sequence, compressed into tokens, in parallel. However, the cost of attention scales quadratically in length, limiting their scalability to longer sequences.

Huck Yang

I am a Sr. Research Scientist at NV Research

I obtained my Ph.D. and M.Sc. from Georgia Institute of Technology, USA with Wallace H. Coulter fellowship and my B.Sc. from National Taiwan University. 

My primary research lies in the area of Multilingual Model Alignments and Speech-Language Modeling. Specifically:

Dale Durran

Durran has a 25% appointment as a Principal Research Scientist in Climate Modeling at NVIDIA and a 60% appointment as a Professor of Atmospheric Sciences at the University of Washington.  At NVIDIA his research focus in on deep learning earth-system modeling for sub-seasonal and seasonal forecasting, forecast ensembles, and generative methods for  fine-scale modeling of convective precipitation and other mesoscale fields.

Constant Field of View Display Size Effects on First-Person Aiming Time

Under constant display field of view, FPS game aiming performance improves with display size, resulting in 3% faster aiming time comparing 13 and 26 inches diagonal.