This research, called Motion by Queries, reveals differences in how self-attention queries (Q) behave in video versus image generation. While Q features in images mainly affect structure, in video they encode both motion and identity information. Understanding this dual role enabled us to develop a zero-shot motion transfer method that is 20 times more efficient than existing approaches, and a training-free technique for consistent multi-shot video generation.
The methods and results in section 5, "Consistent multi-shot video generation", are based on the arXiv version 1 (v1) of this work. Here, in version 2 (v2), we extend and further analyze those findings to efficient motion transfer.
Analysis of Q-injection behavior: (a) Unlike T2I models where Q mainly affects structure, in T2V models Q vectors encode both motion and identity information. (b) When source and target share the same subject (purple), both motion and appearance transfer; with different subjects (green), primarily motion transfers. (c) Extended attention across shots requires longer Q-injection to recover motion, increasing identity leakage.
Text-to-video diffusion models have shown remarkable progress in generating coherent video clips from textual descriptions. However, the interplay between motion, structure, and identity representations in these models remains under-explored. Here, we investigate how self-attention query features (a.k.a. Q features) simultaneously govern motion, structure, and identity and examine the challenges arising when these representations interact. Our analysis reveals that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity, making it hard to transfer motion without the side-effect of transferring identity. Understanding this dual role enabled us to control query feature injection (Q injection) and demonstrate two applications: (1) a zero-shot motion transfer method that is 20 times more efficient than existing approaches, and (2) a training-free technique for consistent multi-shot video generation, where characters maintain identity across multiple video shots while Q injection enhances motion fidelity.
do aerobics | magic, pull scarves | skate, wobbling | |
---|---|---|---|
No Q Intervent.
Keeps characters consistent but compromises motion quality: Frozen body with displaced legs, repetitive swaying, static camera. |
|
|
|
VideoCrafter2
Offers diverse motion but doesn't allow for character consistency |
|
|
|
Full Q Preservation
Identity leaks from VideoCrafter2, but maintains motion |
|
|
|
Ours
Maintains consistent character identity across shots while preserving VideoCrafter2 motion without leaking identity from VideoCrafter2. |
|
|
|
@article{atzmon2025motion,
title = {
Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation.
},
author = {
Atzmon, Yuval and Gal, Rinon and Tewel, Yoad and Kasten, Yoni
and Chechik, Gal
},
journal={arXiv preprint arXiv:2412.07750v2},
year = {2024},
}