NVIDIA Research
Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation.

Motion by Queries: Identity-Motion Trade-offs
in Text-to-Video Generation.

This research, called Motion by Queries, reveals differences in how self-attention queries (Q) behave in video versus image generation. While Q features in images mainly affect structure, in video they encode both motion and identity information. Understanding this dual role enabled us to develop a zero-shot motion transfer method that is 20 times more efficient than existing approaches, and a training-free technique for consistent multi-shot video generation.

The methods and results in section 5, "Consistent multi-shot video generation", are based on the arXiv version 1 (v1) of this work. Here, in version 2 (v2), we extend and further analyze those findings to efficient motion transfer.

Source ✔ Motion, structure ✖ Identity leak ✔ Motion, structure ✔ Different Identity
LTX-Video
 
 
 
VideoCrafter2
 
 
 
8 Steps, T2V-Turbo-V2
 
 
 
Fig. Analysis

Analysis of Q-injection behavior: (a) Unlike T2I models where Q mainly affects structure, in T2V models Q vectors encode both motion and identity information. (b) When source and target share the same subject (purple), both motion and appearance transfer; with different subjects (green), primarily motion transfers. (c) Extended attention across shots requires longer Q-injection to recover motion, increasing identity leakage.


Abstract

Text-to-video diffusion models have shown remarkable progress in generating coherent video clips from textual descriptions. However, the interplay between motion, structure, and identity representations in these models remains under-explored. Here, we investigate how self-attention query features (a.k.a. Q features) simultaneously govern motion, structure, and identity and examine the challenges arising when these representations interact. Our analysis reveals that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity, making it hard to transfer motion without the side-effect of transferring identity. Understanding this dual role enabled us to control query feature injection (Q injection) and demonstrate two applications: (1) a zero-shot motion transfer method that is 20 times more efficient than existing approaches, and (2) a training-free technique for consistent multi-shot video generation, where characters maintain identity across multiple video shots while Q injection enhances motion fidelity.

Motion Transfer Results

Source A rabbit in a field, low poly, orbit shot Source A lion sitting on top of a cliff
 
 
 
 
Source A cartoon of a dancing sloth Source A penguin is sliding on an icy slope
 
 
 
 
Source A robot walking in a puddle Source A ship sailing on the sea
 
 
 
 

More Results

Source A pirate ship sailing in a river Source Dining room captured with a pan right
 
 
 
 
Source A Roman soldier standing in front of the Colosseum Source A fish swimming in the lake
 
 
 
 
Source Sea ice during sunset (Crane up shot) Source A motorbike driving in a forest
 
 
 
 
Source Forest captured with a tilt up. Mars captured with a tilt up Snowy field captured with a tilt up
 
 
 
 

Comparisons

Source Ours Motion Inversion DMT
 
Two monkeys playing with coconuts
 
 
Source Ours Motion Inversion DMT
 
A train riding on rails in autumn view
 
 
Source Ours Motion Inversion DMT
 
A boy walking in a field
 
 
Source Ours Motion Inversion DMT
 
A fox sitting in a snowy mountain
 
 
Additional Comparisons (Click to expand)

Additional Comparisons

Source Ours Motion Inversion
 
A Roman soldier standing in front of the Colosseum
 
Source Ours Motion Inversion
 
Microchip (Custom Shot)
 
Source Ours Motion Inversion
 
 
 
Source Ours Motion Inversion
 
A turtle plods in the sea
 
Source Ours Motion Inversion
 
A penguin is sliding on an icy slope
 
Source Ours Motion Inversion
 
A rabbit in a field, low poly, orbit shot
 

Multi-Shot Subject Generation - Q injection Analysis

do aerobics magic, pull scarves skate, wobbling
No Q Intervent.
Keeps characters consistent but compromises motion quality:
Frozen body with displaced legs, repetitive swaying, static camera.
VideoCrafter2
Offers diverse motion but doesn't allow for character consistency
Full Q Preservation
Identity leaks from VideoCrafter2, but maintains motion
Ours
Maintains consistent character identity across shots while preserving VideoCrafter2 motion without leaking identity from VideoCrafter2.

BibTeX Citation

@article{atzmon2025motion,
    title = {
        Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation.
    },
    author = {
    Atzmon, Yuval and Gal, Rinon and Tewel, Yoad and Kasten, Yoni
    and Chechik, Gal
    },
    journal={arXiv preprint arXiv:2412.07750v2},
    year = {2024},
}