This research, called Motion by Queries, reveals differences in how self-attention queries (Q) behave in video versus image generation. While Q features in images mainly affect structure, in video they encode both motion and identity information. Understanding this dual role enabled us to develop a zero-shot motion transfer method that is 10 times more efficient than existing approaches, and a training-free technique for consistent multi-shot video generation.

The methods and results in section 5, "Consistent multi-shot video generation", are based on the arXiv version 1 (v1) of this work. Here, in version 3 (v3), we extend and further analyze those findings to efficient motion transfer using VideoCrafter2 and WAN2.1 (1.3B).

Analysis of Q-injection behavior: (a) Unlike T2I models where Q mainly affects structure, in T2V models Q vectors encode both motion and identity information. (b) When source and target share the same subject (purple), both motion and appearance transfer; with different subjects (green), primarily motion transfers. (c) Extended attention across shots requires longer Q-injection to recover motion, increasing identity leakage.

Abstract

Text-to-video diffusion models have shown remarkable progress in generating coherent video clips from textual descriptions. However, the interplay between motion, structure, and identity representations in these models remains under-explored. Here, we investigate how self-attention query features (a.k.a. Q features) simultaneously govern motion, structure, and identity and examine the challenges arising when these representations interact. Our analysis reveals that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity, making it hard to transfer motion without the side-effect of transferring identity. Understanding this dual role enabled us to control query feature injection (Q injection) and demonstrate two applications: (1) a zero-shot motion transfer method that is 10 times more efficient than existing approaches, and (2) a training-free technique for consistent multi-shot video generation, where characters maintain identity across multiple video shots while Q injection enhances motion fidelity.

Motion Transfer with VideoCrafter2

Motion Transfer with WAN 2.1 (1.3B)

Comparisons

Multi-Shot Subject Generation - Q injection Analysis


Source	A rabbit in a field, low poly, orbit shot	Source	A Roman soldier standing in front of the Colosseum

Source	A cartoon of a dancing sloth	Source	A goose walking in a puddle

Source	A horse crossing a wide river in a meadow	Source	A ship sailing on the sea


Source	A cat is chasing a mouse in a beach	Source	A monkey climbing on a wall

Source	A penguin is sliding on an icy slope	Source	A teddy bear on a scooter, viewed from behind, a smooth tracking shot

Source	A lion sitting on top of a cliff captured with a zoom out	Source	A tiger in a push-up position performing dumbbell rows in the forest

Source	A goalkeeper in a red jersey leaps skyward to catch the ball	Source	A dolphin jumping into the ocean


Source	Ours	Motion Clone	Motion Inversion	DMT
	Two monkeys playing with coconuts
Source	Ours	Motion Clone	Motion Inversion	DMT
	A boy walking in a field
Source	Ours	Motion Clone	Motion Inversion	DMT
	A train riding on rails in autumn view
Source	Ours	Motion Clone	Motion Inversion	DMT
	A fox sitting in a snowy mountain

Source	Ours	Motion Clone	Motion Inversion
	A penguin is sliding on an icy slope
Source	Ours	Motion Clone	Motion Inversion
	A Roman soldier standing in front of the Colosseum
Source	Ours	Motion Clone	Motion Inversion
	Microchip (Custom Shot)
Source	Ours	Motion Clone	Motion Inversion
	A rabbit in a field, low poly, orbit shot

BibTeX Citation

@article{atzmon2025motion,
    title = {
        Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation.
    },
    author = {
    Atzmon, Yuval and Gal, Rinon and Tewel, Yoad and Kasten, Yoni
    and Chechik, Gal
    },
    journal={arXiv preprint arXiv:2412.07750v3},
    year = {2024},
}

Source	✔ Motion, structure ✖ Identity leak	✔ Motion, structure ✔ Different Identity
WAN 2.1 (1.3B)
VideoCrafter2
8 Steps, T2V-Turbo-V2
LTX-Video

	do aerobics	magic, pull scarves	skate, wobbling
No Q Intervent. Keeps characters consistent but compromises motion quality: Frozen body with displaced legs, repetitive swaying, static camera.
VideoCrafter2 Offers diverse motion but doesn't allow for character consistency
Full Q Preservation Identity leaks from VideoCrafter2, but maintains motion
Ours Maintains consistent character identity across shots while preserving VideoCrafter2 motion without leaking identity from VideoCrafter2.

Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation.