This research, called Motion by Queries, reveals differences in how self-attention queries (Q) behave in video versus image generation. While Q features in images mainly affect structure, in video they encode both motion and identity information. Understanding this dual role enabled us to develop a zero-shot motion transfer method that is 10 times more efficient than existing approaches, and a training-free technique for consistent multi-shot video generation.
The methods and results in section 5, "Consistent multi-shot video generation", are based on the arXiv version 1 (v1) of this work. Here, in version 3 (v3), we extend and further analyze those findings to efficient motion transfer using VideoCrafter2 and WAN2.1 (1.3B).
Analysis of Q-injection behavior: (a) Unlike T2I models where Q mainly affects structure, in T2V models Q vectors encode both motion and identity information. (b) When source and target share the same subject (purple), both motion and appearance transfer; with different subjects (green), primarily motion transfers. (c) Extended attention across shots requires longer Q-injection to recover motion, increasing identity leakage.
Text-to-video diffusion models have shown remarkable progress in generating coherent video clips from textual descriptions. However, the interplay between motion, structure, and identity representations in these models remains under-explored. Here, we investigate how self-attention query features (a.k.a. Q features) simultaneously govern motion, structure, and identity and examine the challenges arising when these representations interact. Our analysis reveals that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity, making it hard to transfer motion without the side-effect of transferring identity. Understanding this dual role enabled us to control query feature injection (Q injection) and demonstrate two applications: (1) a zero-shot motion transfer method that is 10 times more efficient than existing approaches, and (2) a training-free technique for consistent multi-shot video generation, where characters maintain identity across multiple video shots while Q injection enhances motion fidelity.
Source | A rabbit in a field, low poly, orbit shot | Source | A Roman soldier standing in front of the Colosseum |
---|---|---|---|
|
|
|
|
Source | A cartoon of a dancing sloth | Source | A goose walking in a puddle |
|
|
|
|
Source | A horse crossing a wide river in a meadow | Source | A ship sailing on the sea |
|
|
|
|
Source | A cat is chasing a mouse in a beach | Source | A monkey climbing on a wall |
---|---|---|---|
|
|
|
|
Source | A penguin is sliding on an icy slope | Source | A teddy bear on a scooter, viewed from behind, a smooth tracking shot |
|
|
|
|
Source | A lion sitting on top of a cliff captured with a zoom out | Source | A tiger in a push-up position performing dumbbell rows in the forest |
|
|
|
|
Source | A goalkeeper in a red jersey leaps skyward to catch the ball | Source | A dolphin jumping into the ocean |
|
|
|
|
do aerobics | magic, pull scarves | skate, wobbling | |
---|---|---|---|
No Q Intervent.
Keeps characters consistent but compromises motion quality: Frozen body with displaced legs, repetitive swaying, static camera. |
|
|
|
VideoCrafter2
Offers diverse motion but doesn't allow for character consistency |
|
|
|
Full Q Preservation
Identity leaks from VideoCrafter2, but maintains motion |
|
|
|
Ours
Maintains consistent character identity across shots while preserving VideoCrafter2 motion without leaking identity from VideoCrafter2. |
|
|
|
@article{atzmon2025motion,
title = {
Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation.
},
author = {
Atzmon, Yuval and Gal, Rinon and Tewel, Yoad and Kasten, Yoni
and Chechik, Gal
},
journal={arXiv preprint arXiv:2412.07750v3},
year = {2024},
}