PYoCo — Cosmos Lab

Abstract

We present PYoCo, a large-scale text-to-video diffusion model that is finetuned from a state-of-the-art image generation model, eDiff-I, with a novel video noise prior. Together with several design choices from prior works, including the use of temporal attention, joint image-video finetuning, a cascaded generation architecture, and ensemble of expert denoisers, PYoCo establishes a new state-of-the-art for video generation outperforming competing methods on several benchmark datasets. PYoCo can achieve high-quality zero-shot video synthesis capability with superior photorealism and temporal consistency.

Gallery

Method

Results

Compositional Generation

Protagonist:

Action:

Location:

Style Generation

Prompt:

Style:

Quantitative Benchmarks

Unconditional video generation on UCF-101.
Model	Inception Score ↑
TGAN	15.83 ± .18
LDVD-GAN	22.91 ± .19
VideoGPT	24.69 ± .30
MoCoGAN-HD	32.36 ± .00
DIGAN	29.71 ± .53
CCVS	24.47 ± .13
StyleGAN-V	23.94 ± .73
VDM	57.00 ± .62
TATS	57.63 ± .73
PYoCo (112M)	57.93 ± .24
PYoCo (253M)	60.01 ± .51

We propose a video diffusion noise prior tailored for finetuning text-to-image diffusion models for text-to-video synthesis.
We show that finetuning a text-to-image diffusion model with this prior leads to better knowledge transfer and efficient training.
On the small-scale unconditional generation benchmark, we achieve a new state-of-the-art with a 10× smaller model and 14× less training time.
On the zero-shot MSR-VTT evaluation, our model achieves a new state-of-the-art FID of 9.73.

Zero-shot text-to-video on MSR-VTT.
Model	Zero-Shot FID ↓
NUWA-Chinese	47.68
CogVideo-Chinese	24.78
CogVideo-English	23.59
Make-A-Video	13.17
PYoCo (Config-A)	10.21
PYoCo (Config-B)	9.95
PYoCo (Config-C)	9.91
PYoCo (Config-D)	9.73

Citation

@inproceedings{ge2023pyoco,
  title={Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models},
  author={Ge, Songwei and Nah, Seungjun and Liu, Guilin and Poon, Tyler and Tao, Andrew and Catanzaro, Bryan and Jacobs, David and Huang, Jia-Bin and Liu, Ming-Yu and Balaji, Yogesh},
  booktitle={IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2023}
}