EquiVDM

EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise

Paper

Abstract

Temporally consistent video-to-video generation is essential for applications of video diffusion models in areas such as sim-to-real, style-transfer, video upsampling, etc. In this paper, we propose a video diffusion framework that leverages temporally consistent noise to generate coherent video frames without specialized modules or additional constraints. We show that the standard training objective of diffusion models, when applied with temporally consistent noise, encourages the model to be equivariant to spatial transformations in input video and noise. This enables our model to better follow motion patterns from the input video, producing aligned motion and high-fidelity frames. Furthermore, we extend our approach to 3D-consistent video generation by attaching noise as textures on 3D meshes, ensuring 3D consistency in sim-to-real applications. Experimental results demonstrate that our method surpasses state-of-the-art baselines in motion alignment, 3D consistency, and video quality while requiring only a few sampling steps in practice.

Method Overview

Recent advancements in video diffusion model (VDM) have significantly improved the quality and photo-realisim of generated videos. To achieve temporal consistency in generated videos, 3D convolution or attentinon layers are introduced to better capture and propagate spatiotemporal information. While these designs can improve temporal consistency, they typically rely on extensive training on large-scale, high-quality video datasets to learn to generate realistic frames with natural, coherent motion patterns from random noise.

An alternative line of work aims to generate temporally consistent frames by directly sampling from temporally correlated noise. This is particularly appealing for video-to-video applications where an input video can be used to drive the coherent noise. However, standard diffusion networks are not intrinsically equivariant to noise warping transformations, due to their highly nonlinear layers. Thus sampling-time guidance or regularization strategies are needed to achieve approximate equivariance, which can introduce additional hyperparameters and complexity into the generation pipeline.

In this work, we introduce EquiVDM, a video-to-video diffusion model that is equivariant to the spatial warping transformation of the input noise. We show that the standard training objective of diffusion models, when applied with temporally consistent noise, encourages the model to be equivariant to spatial transformations in the input. Our approach introduces equivariance as an inherent property of the VDM itself during training, thus the temporal coherence comes at no extra cost in terms of model complexity or runtime overhead.

A video diffusion model that is equivariant to input spatial transformations generates videos with the same spatial transformation when provided with temporally consistent noise.

Getting 3D consistent noise for 3D consistent video generation

Additionally, we propose a novel approach to construct 3D- consistent noise for 3D-consistent video generation. We attach Gaussian noise as textures to 3D meshes and render the resulting noise images from various camera viewpoints as input to the diffusion model. We train EquiVDM on 2D videos without any 3D information using motion-based warped noise and we switched to the 3D consistent noise at the inference time. We demonstrate that EquiVDM leverages its equivariance properties to align the generated video frames faithfully according to the underlying 3D geometry and camera poses, in applications such as sim2real where a 3D mesh of the input scene is available. Please see the qualitative results demonstrating 3D-consistent video generation here.

A Gaussian noise texture is attached to the 3D mesh surface. 3D consistent noise map for different camera views can be computed by warping the uv texture map to the image plane. The warping operation is defined by the 3D mesh, camera pose and intrinsics.

Compatible with ControlNets or other additional modules for controlled generation

The table below shows the performance comparison between EquiVDM and other methods. EquiVDM-base generates videos from warped noise using text-only prompts, while EquiVDM-full has the additional conditioning modules finetuned from CtrlAdapter [1]. Even without the additional control modules, EquiVDM-base can already perform on par with or better than the compared methods with dedicated control modules. This demonstrates that EquiVDM can learn to generate better videos by taking advantage of the temporal correlation from the warped noise input. It also indicates that the temporal correlation in the warped noise can serve as a strong prior for both motion patterns and semantic information. With additional ControlNet modules, the quality and prompt-following of the generated videos from EquiVDM-full further improve, indicating that the benefit of equivariance is complementary to the additional conditioning modules. As a result, for video-to-video generation tasks, we can improve performance by making the full model noise-equivariant without any architectural modifications.

Video generation performance for models with both text prompt and control frames. To measure the temporal consistency and the alignment of the motions between the generate and ground truth videos, we first extract the dense optical flow from the frames in the ground truth videos, then use it to warp the generated videos accordingly, and compute the PSNR and SSIM scores between the warped and the target frames.

Reference:
[1] Lin et al., ICLR 2025. Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model.

Less sampling steps needed with temporally consistent noise input

Since the motion information about the video is already included in the warped noise, less sampling steps are needed compared to the one using independent noise where both the motion and appearance have to be generated from scratch, as shown in the videos below.

The generated videos using independent noise and warped noise in EquiVDM (Ours) with 5, 20 and 50 sampling steps.

The generated videos using independent noise input and warped noise in EquiVDM (Ours) with different number of sampling steps. The metrics saturate quickly, indicating that the appearance of the video can be generated from scratch with few sampling steps given the temporally consistent noise input.

Comparisons

We compare our method with video-to-video diffusion models with per-frame dense conditioning. In the results below, CtrlAdapter (Lin et al., 2024) and EquiVDM-full used soft-edge map as control signal for each frame along with the text prompt. EquiVDM-base only used text prompt. Our base EquiVDM (EquiVDM-base) without additinal controlling modules outperforms models that employ dense frame-level conditioning. Moreover, when EquiVDM is paired with ControlNet (EquiVDM-full), its performance improves even further. Please see the comparisons with more methods.

CtrlAdapter	EquiVDM-base	EquiVDM-full	Ground Truth	Noise

BibTex:

@inproceedings{liu2025equivdm,
  title={EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise},
  author={Liu, Chao and Vahdat, Arash},
  booktitle={},
  year={2025},
}

Privacy Policy — Manage My Privacy — Do Not Sell or Share My Data — Terms of Service — Accessibility — Corporate Policies — Contact