NVIDIA Research
Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation.

Motion by Queries: Identity-Motion Trade-offs
in Text-to-Video Generation.

This research, called Motion by Queries, reveals differences in how self-attention queries (Q) behave in video versus image generation. While Q features in images mainly affect structure, in video they encode both motion and identity information. Understanding this dual role enabled us to develop a zero-shot motion transfer method that is 10 times more efficient than existing approaches, and a training-free technique for consistent multi-shot video generation.

The methods and results in section 5, "Consistent multi-shot video generation", are based on the arXiv version 1 (v1) of this work. Here, in version 3 (v3), we extend and further analyze those findings to efficient motion transfer using VideoCrafter2 and WAN2.1 (1.3B).

Source ✔ Motion, structure ✖ Identity leak ✔ Motion, structure ✔ Different Identity
WAN 2.1 (1.3B)
 
 
 
VideoCrafter2
 
 
 
8 Steps, T2V-Turbo-V2
 
 
 
LTX-Video
 
 
 
Fig. Analysis

Analysis of Q-injection behavior: (a) Unlike T2I models where Q mainly affects structure, in T2V models Q vectors encode both motion and identity information. (b) When source and target share the same subject (purple), both motion and appearance transfer; with different subjects (green), primarily motion transfers. (c) Extended attention across shots requires longer Q-injection to recover motion, increasing identity leakage.


Abstract

Text-to-video diffusion models have shown remarkable progress in generating coherent video clips from textual descriptions. However, the interplay between motion, structure, and identity representations in these models remains under-explored. Here, we investigate how self-attention query features (a.k.a. Q features) simultaneously govern motion, structure, and identity and examine the challenges arising when these representations interact. Our analysis reveals that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity, making it hard to transfer motion without the side-effect of transferring identity. Understanding this dual role enabled us to control query feature injection (Q injection) and demonstrate two applications: (1) a zero-shot motion transfer method that is 10 times more efficient than existing approaches, and (2) a training-free technique for consistent multi-shot video generation, where characters maintain identity across multiple video shots while Q injection enhances motion fidelity.

Motion Transfer with VideoCrafter2

Source A rabbit in a field, low poly, orbit shot Source A Roman soldier standing in front of the Colosseum
 
 
 
 
Source A cartoon of a dancing sloth Source A goose walking in a puddle
 
 
 
 
Source A horse crossing a wide river in a meadow Source A ship sailing on the sea
 
 
 
 

Motion Transfer with WAN 2.1 (1.3B)

Source A cat is chasing a mouse in a beach Source A monkey climbing on a wall
 
 
 
 
Source A penguin is sliding on an icy slope Source A teddy bear on a scooter, viewed from behind, a smooth tracking shot
 
 
 
 
Source A lion sitting on top of a cliff captured with a zoom out Source A tiger in a push-up position performing dumbbell rows in the forest
 
 
 
 
Source A goalkeeper in a red jersey leaps skyward to catch the ball Source A dolphin jumping into the ocean
 
 
 
 

Comparisons

"Ours" is using VideoCrafter2

Source Ours Motion Clone Motion Inversion DMT
 
Two monkeys playing with coconuts
 
 
 
Source Ours Motion Clone Motion Inversion DMT
 
A boy walking in a field
 
 
 
Source Ours Motion Clone Motion Inversion DMT
 
A train riding on rails in autumn view
 
 
 
Source Ours Motion Clone Motion Inversion DMT
 
A fox sitting in a snowy mountain
 
 
 

Source Ours Motion Clone Motion Inversion
 
A penguin is sliding on an icy slope
 
 
Source Ours Motion Clone Motion Inversion
 
A Roman soldier standing in front of the Colosseum
 
 
Source Ours Motion Clone Motion Inversion
 
Microchip (Custom Shot)
 
 
Source Ours Motion Clone Motion Inversion
 
A rabbit in a field, low poly, orbit shot
 
 

Multi-Shot Subject Generation - Q injection Analysis

do aerobics magic, pull scarves skate, wobbling
No Q Intervent.
Keeps characters consistent but compromises motion quality:
Frozen body with displaced legs, repetitive swaying, static camera.
VideoCrafter2
Offers diverse motion but doesn't allow for character consistency
Full Q Preservation
Identity leaks from VideoCrafter2, but maintains motion
Ours
Maintains consistent character identity across shots while preserving VideoCrafter2 motion without leaking identity from VideoCrafter2.

BibTeX Citation

@article{atzmon2025motion,
    title = {
        Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation.
    },
    author = {
    Atzmon, Yuval and Gal, Rinon and Tewel, Yoad and Kasten, Yoni
    and Chechik, Gal
    },
    journal={arXiv preprint arXiv:2412.07750v3},
    year = {2024},
}