We present PYoCo, a large-scale text-to-video diffusion model that is finetuned from a state-of-the-art image generation model, eDiff-I, with a novel video noise prior. Together with several design choices from the prior works, including the use of temporal attention, joint image-video finetuning, a cascaded generation architecture, and ensemble of expert denoisers, PYoCo establishes a new state-of-the-art for video generation outperforming competing methods on several benchmark datasets. PYoCo can achieve high-quality zero-shot video synthesis capability with superior photorealism and temporal consistency.
For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing.