Abstract

We present PYoCo, a large-scale text-to-video diffusion model that is finetuned from a state-of-the-art image generation model, eDiff-I, with a novel video noise prior. Together with several design choices from prior works, including the use of temporal attention, joint image-video finetuning, a cascaded generation architecture, and ensemble of expert denoisers, PYoCo establishes a new state-of-the-art for video generation outperforming competing methods on several benchmark datasets. PYoCo can achieve high-quality zero-shot video synthesis capability with superior photorealism and temporal consistency.

Method

Results

Compositional Generation

Style Generation

Quantitative Benchmarks

Unconditional video generation on UCF-101.
ModelInception Score ↑
TGAN15.83 ± .18
LDVD-GAN22.91 ± .19
VideoGPT24.69 ± .30
MoCoGAN-HD32.36 ± .00
DIGAN29.71 ± .53
CCVS24.47 ± .13
StyleGAN-V23.94 ± .73
VDM57.00 ± .62
TATS57.63 ± .73
PYoCo (112M)57.93 ± .24
PYoCo (253M)60.01 ± .51
  • We propose a video diffusion noise prior tailored for finetuning text-to-image diffusion models for text-to-video synthesis.
  • We show that finetuning a text-to-image diffusion model with this prior leads to better knowledge transfer and efficient training.
  • On the small-scale unconditional generation benchmark, we achieve a new state-of-the-art with a 10× smaller model and 14× less training time.
  • On the zero-shot MSR-VTT evaluation, our model achieves a new state-of-the-art FID of 9.73.
Zero-shot text-to-video on MSR-VTT.
ModelZero-Shot FID ↓
NUWA-Chinese47.68
CogVideo-Chinese24.78
CogVideo-English23.59
Make-A-Video13.17
PYoCo (Config-A)10.21
PYoCo (Config-B)9.95
PYoCo (Config-C)9.91
PYoCo (Config-D)9.73

Citation

@inproceedings{ge2023pyoco,
  title={Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models},
  author={Ge, Songwei and Nah, Seungjun and Liu, Guilin and Poon, Tyler and Tao, Andrew and Catanzaro, Bryan and Jacobs, David and Huang, Jia-Bin and Liu, Ming-Yu and Balaji, Yogesh},
  booktitle={IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2023}
}