PYoCo: A Noise Prior for Video Diffusion Models

P Y o C o

Songwei Ge^* Seungjun Nah Guilin Liu Tyler Poon^🦅 Andrew Tao
Bryan Catanzaro David Jacobs Jia-Bin Huang Ming-Yu Liu Yogesh Balaji

University of Maryland, College Park NVIDIA Corporation ^🦅University of Chicago

^* Work done during the internship at NVIDIA.

ICCV 2023

We present PYoCo, a large-scale text-to-video diffusion model that is finetuned from a state-of-the-art image generation model, eDiff-I, with a novel video noise prior. Together with several design choices from the prior works, including the use of temporal attention, joint image-video finetuning, a cascaded generation architecture, and ensemble of expert denoisers, PYoCo establishes a new state-of-the-art for video generation outperforming competing methods on several benchmark datasets. PYoCo can achieve high-quality zero-shot video synthesis capability with superior photorealism and temporal consistency.

For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing.

PYoCo can generate compositional videos

Protagonist:

Action:

Location:

PYoCo can generate videos of different styles

Prompt:

Style:

PYoCo Introduction

State-of-the-art on the video generation benchmarks

Unconditional video generation results on the UCF-101 dataset.
Model	Inception Score ↑
TGAN (saitoet al., 2017)	15.83 ± .18
LDVD-GAN (kahembweet al., 2020)	22.91 ± .19
VideoGPT (yan et al., 2021)	24.69 ± .30
MoCoGAN-HD (tian et al., 2021)	32.36 ± .00
DIGAN (yu et al., 2021)	29.71 ± .53
CCVS (le et al., 2021)	24.47 ± .13
StyleGAN-V (skorokhodov et al., 2021)	23.94 ± .73
VDM (ho et al., 2022)	57.00 ± .62
TATS (ge et al., 2022)	57.63 ± .73
PYoCo (112M)	57.93 ± .24
PYoCo (253M)	60.01 ± .51

We propose a video diffusion noise prior tailored for finetuning text-to-image diffusion models for text-to-video synthesis.
We show that finetuning a text-to-image diffusion model with this prior leads to better knowledge transfer and efficient training.
On the small-scale unconditional generation benchmark, we achieve a new state-of-the-art with a 10× smaller model and 14× less training time.
On the zero-shot MSR-VTT evaluation, our model achieves a new state-of-the-art FID of 9.73.

Zero-shot text-to-video generation on the MSR-VTT dataset.
Model	Zero-Shot FID ↓
NUWA-Chinese (wu et al., 2022)	47.68
CogVideo-Chinese (hong et al., 2022)	24.78
CogVideo-English (hong et al., 2022)	23.59
Make-A-Video (singer et al., 2022)	13.17
PYoCo (Config-A)	10.21
PYoCo (Config-B)	9.95
PYoCo (Config-C)	9.91
PYoCo (Config-D)	9.73