We present PYoCo, a large-scale text-to-video diffusion model that is finetuned from a state-of-the-art
image generation model, eDiff-I, with a novel video noise prior. Together with several design choices from
prior works, including the use of temporal attention, joint image-video finetuning, a cascaded generation
architecture, and ensemble of expert denoisers, PYoCo establishes a new state-of-the-art for video
generation outperforming competing methods on several benchmark datasets. PYoCo can achieve high-quality
zero-shot video synthesis capability with superior photorealism and temporal consistency.
Gallery
Method
Results
Compositional Generation
Protagonist:
Action:
Location:
Style Generation
Prompt:
Style:
Quantitative Benchmarks
Unconditional video generation on UCF-101.
Model
Inception Score ↑
TGAN
15.83 ± .18
LDVD-GAN
22.91 ± .19
VideoGPT
24.69 ± .30
MoCoGAN-HD
32.36 ± .00
DIGAN
29.71 ± .53
CCVS
24.47 ± .13
StyleGAN-V
23.94 ± .73
VDM
57.00 ± .62
TATS
57.63 ± .73
PYoCo (112M)
57.93 ± .24
PYoCo (253M)
60.01 ± .51
We propose a video diffusion noise prior tailored for finetuning text-to-image diffusion models for text-to-video synthesis.
We show that finetuning a text-to-image diffusion model with this prior leads to better knowledge transfer and efficient training.
On the small-scale unconditional generation benchmark, we achieve a new state-of-the-art with a 10× smaller model and 14× less training time.
On the zero-shot MSR-VTT evaluation, our model achieves a new state-of-the-art FID of 9.73.
Zero-shot text-to-video on MSR-VTT.
Model
Zero-Shot FID ↓
NUWA-Chinese
47.68
CogVideo-Chinese
24.78
CogVideo-English
23.59
Make-A-Video
13.17
PYoCo (Config-A)
10.21
PYoCo (Config-B)
9.95
PYoCo (Config-C)
9.91
PYoCo (Config-D)
9.73
Citation
@inproceedings{ge2023pyoco,
title={Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models},
author={Ge, Songwei and Nah, Seungjun and Liu, Guilin and Poon, Tyler and Tao, Andrew and Catanzaro, Bryan and Jacobs, David and Huang, Jia-Bin and Liu, Ming-Yu and Balaji, Yogesh},
booktitle={IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2023}
}