P Y o C o
Songwei Ge*  Seungjun Nah  Guilin Liu  Tyler Poon🦅   Andrew Tao 
Bryan Catanzaro  David Jacobs  Jia-Bin Huang  Ming-Yu Liu  Yogesh Balaji
University of Maryland, College Park   NVIDIA Corporation   🦅University of Chicago
* Work done during the internship at NVIDIA.

We present PYoCo, a large-scale text-to-video diffusion model that is finetuned from a state-of-the-art image generation model, eDiff-I, with a novel video noise prior. Together with several design choices from the prior works, including the use of temporal attention, joint image-video finetuning, a cascaded generation architecture, and ensemble of expert denoisers, PYoCo establishes a new state-of-the-art for video generation outperforming competing methods on several benchmark datasets. PYoCo can achieve high-quality zero-shot video synthesis capability with superior photorealism and temporal consistency.

For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing.

PYoCo can generate compositional videos

PYoCo can generate videos of different styles

PYoCo Introduction

State-of-the-art on the video generation benchmarks

Unconditional video generation results on the UCF-101 dataset.
Model Inception Score ↑
TGAN (saitoet al., 2017) 15.83 ± .18
LDVD-GAN (kahembweet al., 2020) 22.91 ± .19
VideoGPT (yan et al., 2021) 24.69 ± .30
MoCoGAN-HD (tian et al., 2021) 32.36 ± .00
DIGAN (yu et al., 2021) 29.71 ± .53
CCVS (le et al., 2021) 24.47 ± .13
StyleGAN-V (skorokhodov et al., 2021) 23.94 ± .73
VDM (ho et al., 2022) 57.00 ± .62
TATS (ge et al., 2022) 57.63 ± .73
PYoCo (112M) 57.93 ± .24
PYoCo (253M) 60.01 ± .51
  • We propose a video diffusion noise prior tailored for finetuning text-to-image diffusion models for text-to-video synthesis.
  • We show that finetuning a text-to-image diffusion model with this prior leads to better knowledge transfer and efficient training.
  • On the small-scale unconditional generation benchmark, we achieve a new state-of-the-art with a 10× smaller model and 14× less training time.
  • On the zero-shot MSR-VTT evaluation, our model achieves a new state-of-the-art FID of 9.73.
Zero-shot text-to-video generation on the MSR-VTT dataset.
Model Zero-Shot FID ↓
NUWA-Chinese (wu et al., 2022) 47.68
CogVideo-Chinese (hong et al., 2022) 24.78
CogVideo-English (hong et al., 2022) 23.59
Make-A-Video (singer et al., 2022) 13.17
PYoCo (Config-A) 10.21
PYoCo (Config-B) 9.95
PYoCo (Config-C) 9.91
PYoCo (Config-D) 9.73