Motion Attribution for Video Generation

Spatial Intelligence Lab

Motion Attribution for Video Generation

Xindi Wu^{1 2}

Despoina Paschalidou¹

Jun Gao¹

Antonio Torralba³

Laura Leal-Taixé¹

Olga Russakovsky²

Sanja Fidler¹

Jonathan Lorraine¹

¹NVIDIA

²Princeton University

³MIT CSAIL

Paper Slides Poster Code (Coming Soon)

TL;DR: We propose Motive, a scalable, motion-centric data attribution framework for video generation to identify which training clips improve or degrade motion dynamics, enabling curation and more.

Abstract: Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.

Motion Attribution Examples

Motive identifies influential training clips across different motion types. We demonstrate how our method attributes motion in video generation by showing positive and negative influential samples for different query motions.

← Scroll horizontally to see more motion examples →

Query Motion

Float

Foam cube float on water

✓

Top Positive Influential Samples

High Motion Similarity

I = +0.295

Wave crashing over pier

I = +0.247

Mountain range with clouds

I = +0.234

Surfing on wave

I = +0.233

Waves crashing on cliff

I = +0.231

Ocean waves splashing

I = +0.227

Surfer riding large wave

I = +0.224

Surfer on green wave

I = +0.221

Bright pulsating light

✗

Negative Influential Samples

Conflicting Dynamics

I = -0.206

Cartoon: man and dog

I = -0.202

Rocky landscape with stars

I = -0.200

Cartoon animals in forest

I = -0.196

Cartoon character smiling

I = -0.192

Wheat field swaying

I = -0.189

ISS orbit visualization

I = -0.189

Comic book style scene

I = -0.188

Cartoon animals walking

Query Motion

Roll

Orange rolling on counter

✓

Top Positive Influential Samples

High Motion Similarity

I = +0.307

Wooden hoop rolling

I = +0.256

Flag waving

I = +0.230

Rocket orbiting

I = +0.227

Roulette wheel

I = +0.187

Man drawing

I = +0.185

Cartoon monkey

I = +0.183

Chef stirring

I = +0.170

Planet collision

✗

Negative Influential Samples

Conflicting Dynamics

I = -0.185

Pouring soil

I = -0.177

Solar panels

I = -0.172

Pool table

I = -0.171

Paragliding

I = -0.168

Garden bed

I = -0.168

Water ripples

I = -0.164

Trail with rocks

I = -0.162

Dirt ground

Motive Overview

Our proposed framework, Motive, attributes motion in video diffusion models and uses it to curate finetuning data. The method has three key components: scalable gradient computation, frame-length bias fix, and motion-aware weighting. Below, we overview the key steps:

Motion Attribution

Motion-gradient computation has three steps: (1) detect motion with AllTracker; (2) compute motion-magnitude patches; (3) apply loss-space motion masks to focus gradients on dynamic regions.

Efficient Motion Gradient Computation

Our method is made scalable via a single-sample variant with common randomness and a projection, computed for each pair of training and query data, aggregated for a final ranking, and eventually used to select finetuning subsets.

Problem Formulation. We study data attribution for motion in the finetuning setting. Given a query video and a finetuning dataset, we assign each training clip a motion-aware influence score that quantifies how it contributes to the dynamics observed in the query. Our framework satisfies two key criteria: (i) predictivity, rankings correlate with observed changes when finetuning on the most influential subsets, and (ii) efficiency, scales to modern video generators without expensive Hessian inversion or per-data integration.

Method

1. Scalable Gradient-Based Attribution

We make attribution practical for billion-parameter models through several approximations:

Inverse-Hessian Approximation: Use gradient similarity with identity preconditioner instead of computing exact inverse-Hessian-vector products.
Common Randomness: Evaluate training and test gradients under the same (timestep, noise) pairs to reduce variance and stabilize rankings.
Single-Sample Estimator: Fix a single timestep and shared noise draw for all train-test pairs, reducing compute from O(|dataset| · |timesteps| · cost) to O(|dataset| · cost).
Fastfood Projection: Apply structured Johnson-Lindenstrauss projection to compress gradients, reducing storage from O(|dataset| · dim) to O(|dataset| · projected_dim), making it tractable for modern models.

1. Scalable Gradient-Based Attribution

We make attribution practical for billion-parameter models through several approximations:

Inverse-Hessian Approximation: Use gradient similarity with identity preconditioner instead of computing exact inverse-Hessian-vector products.
Common Randomness: Evaluate training and test gradients under the same $(t, \boldsymbol{\epsilon})$Diffusion timestep $t$ and noise vector $\boldsymbol{\epsilon}$ sampled from standard Gaussian pairs to reduce variance and stabilize rankings: $$ I_{\textnormal{diff}}^{1}(\mathbf{x}_n,\mathbf{x}_{\textnormal{test}}) =\frac{1}{|\mathcal{T}|} \sum_{t,\boldsymbol{\epsilon}\in\mathcal{T}} \frac{\nabla_{\boldsymbol{\theta}}\mathcal{L}_{\textnormal{diff}}(\boldsymbol{\theta};\,\mathbf{x}_{\textnormal{test}},t,\boldsymbol{\epsilon})} {\|\nabla_{\boldsymbol{\theta}}\mathcal{L}_{\textnormal{diff}}(\boldsymbol{\theta};\,\mathbf{x}_{\textnormal{test}},t,\boldsymbol{\epsilon})\|} ^{\!\top} \frac{\nabla_{\boldsymbol{\theta}}\mathcal{L}_{\textnormal{diff}}(\boldsymbol{\theta};\,\mathbf{x}_n,t,\boldsymbol{\epsilon})} {\|\nabla_{\boldsymbol{\theta}}\mathcal{L}_{\textnormal{diff}}(\boldsymbol{\theta};\,\mathbf{x}_n,t,\boldsymbol{\epsilon})\|} $$
Single-Sample Estimator: Fix a single timestep $t_{\textnormal{fix}}$Fixed diffusion timestep used consistently across all examples and shared noise $\boldsymbol{\epsilon}_{\textnormal{fix}} \sim \mathcal{N}(0,\mathbf{I})$Single fixed noise sample shared to reduce variance for all train-test pairs, reducing compute from $\mathcal{O}(|\mathcal{D}| \cdot |\mathcal{T}| \cdot B)$Cost with multiple timesteps per example to $\mathcal{O}(|\mathcal{D}| \cdot B)$Cost with single timestep per example where $B$Computational cost of one forward and backward pass is the cost of a forward+backward pass.
Fastfood Projection: Apply structured Johnson-Lindenstrauss projection to compress gradients. Let $$ \mathbf{P} \in \mathbb{R}^{D' \times D} \quad\text{with}\quad \mathbf{P} :=\frac{1}{\xi\sqrt{D'}}\; \mathbf{S} \mathbf{Q} \mathbf{G} \boldsymbol{\Pi} \mathbf{Q} \mathbf{B} $$ This reduces storage from $\mathcal{O}(|\mathcal{D}| \cdot D)$ to $\mathcal{O}(|\mathcal{D}| \cdot D')$.

The final influence score uses projected, normalized gradients: $$ I_{\textnormal{diff}}^{3}(\mathbf{x}_n,\mathbf{x}_{\textnormal{test}}) ~=~ \tilde{\mathbf{g}}\!\big(\boldsymbol{\theta};\,\mathbf{x}_{\textnormal{test}}\big)^{\!\top} \;\; \tilde{\mathbf{g}}\!\big(\boldsymbol{\theta};\,\mathbf{x}_n\big) $$

2. Motion Attribution via Motion Masking

To isolate temporal dynamics from static appearance, we introduce motion-weighted gradients:

Motion Detection: Use AllTracker to extract optical flow and motion magnitudes in pixel space.
Motion Weighting: Compute motion magnitude at each location, min-max normalize to [0,1], and bilinearly downsample to match latent space dimensions.
Loss-Space Masking: Reweight per-location gradients by motion masks, emphasizing dynamic regions and de-emphasizing static backgrounds.
Motion-Aware Influence: Compute influence scores using motion-weighted gradients, so rankings identify training clips that shape motion rather than appearance.

2. Motion Attribution via Motion Masking

To isolate temporal dynamics from static appearance, we introduce motion-weighted gradients:

Motion Detection: Given a video $\mathbf{v} \in \mathbb{R}^{F \times H \times W \times 3}$Video with F frames, each of height H, width W, and 3 RGB channels with $F$Number of frames in the video frames of resolution $H \times W$Height and width of each frame in pixels, encode to latent space as $\mathbf{h} = E(\mathbf{v})$Latent representation via VAE encoder E $\in \mathbb{R}^{F \times H/s \times W/s \times C}$ with downsampling factor $s=8$Spatial downsampling factor (8× compression). Use AllTracker $\mathcal{A}$AllTracker motion detection model to extract motion: $A = \mathcal{A}(\mathbf{v})$Motion tensor with optical flow and confidence containing optical flow displacement vectors $\mathbf{D}_{f}(h,w) = (\mathrm{d}w, \mathrm{d}h)$2D displacement showing pixel movement at location (h,w) at each pixel location.
Motion Weighting: Compute motion magnitude $M_{f}(h,w) = \|\mathbf{D}_{f}(h,w)\|_2$L2 norm (magnitude) of displacement vector and min-max normalize to $[0,1]$Normalized range where 1 = maximum motion: $$ \mathbf{W}(f,h,w) ~=~ \frac{M_{f}(h,w) - \min_{f',h',w'} M_{f'}(h',w')} {\max_{f',h',w'} M_{f'}(h',w') - \min_{f',h',w'} M_{f'}(h',w') + \zeta} $$ Bilinearly downsample to latent grid dimensions.
Motion-Weighted Loss: Define per-location squared error and compute motion-weighted loss: $$ \mathcal{L}_{\textnormal{mot}}(\boldsymbol{\theta};\mathbf{v},\mathbf{c}) ~=~\frac{1}{F_{\mathbf{v}}} \operatorname{mean}_{f,\tilde{h},\tilde{w}} \left[\tilde{\mathbf{W}}_{\mathbf{v},\mathbf{c}}(f,\tilde{h},\tilde{w})\cdot \tilde{\mathcal{L}}_{\boldsymbol{\theta},\mathbf{v},\mathbf{c}}(f, \tilde{h},\tilde{w})\right] $$
Motion-Aware Influence: Replace standard gradients with motion-weighted gradients $\mathbf{g}_{\textnormal{mot}} := \nabla_{\boldsymbol{\theta}}\mathcal{L}_{\textnormal{mot}}$Gradient of motion-weighted loss w.r.t. model parameters: $$ I_{\textnormal{mot}}\!(\mathbf{v}_n,\!\hat{\mathbf{v}}) \!=\! \tilde{\mathbf{g}}_{\textnormal{mot}}\!(\boldsymbol{\theta}\!,\!\hat{\mathbf{v}})^{\!\top}\! \tilde{\mathbf{g}}_{\textnormal{mot}}\!(\boldsymbol{\theta}\!,\!\mathbf{v}_n) $$ This isolates influence on temporal dynamics rather than static content.

3. Influential Subset Selection

Single-query: We select the highest-scoring examples based on motion-aware influence scores.

Multi-query aggregation: For multiple query motions, we adopt majority voting: a training sample receives a vote if its score exceeds a percentile threshold for that query. We rank samples by total votes and select the top examples to form the finetuning subset, emphasizing samples consistently influential across queries.

Data Selection Strategies

Motive supports two data selection strategies, depending on whether you want to optimize for a single target motion or multiple diverse motions.

Single-Query Selection

Optimize for one specific motion

Query: "Roll"

↓

I = +0.89

I = +0.86

I = +0.84

↓

Select Top-K by Score

Multi-Query Aggregation

Vote if the score is over a certain threshold

Q1: "Roll"

Q2: "Spin"

Q3: "Slide"

↓

Clip A: ✓✓✓

Clip B: ✓✓

Clip C: ✓

↓

Select Top-K by Votes

Motion Visualization

Motive uses optical flow to detect and visualize motion in videos. Below we show how our method identifies dynamic regions by comparing original videos with motion visualizations that isolate and highlight moving regions.

Original Video

Motion Highlighted

← Scroll to view more examples →

Interpretation: The motion visualization clearly isolates and highlights dynamic regions (people, moving objects, camera motion) while filtering out static backgrounds. This motion-aware weighting enables Motive to focus attribution on temporal patterns rather than appearance, leading to better data selection for improving motion quality.

Motion Quality Improvements with Motive-Curated Data

We use Motive to curate finetuning data for video generation models. Below, we compare videos from models finetuned with randomly selected data versus Motive-selected high-influence data.

Motion Type	Base	Random Selection	Motive (Ours)
Compress A rubber ball being compressed under a flat press
Spin A single coin spins quickly on a polished glass surface
Slide A white mug slid across a wooden kitchen counter
Free Fall A red ball drops vertically from above onto wooden surface

Quantitative Results

We evaluate our motion attribution framework on VBench and through human evaluation experiments. VBench compares subject consistency, background consistency, motion smoothness, dynamic degree, aesthetic quality, and imaging quality across different data selection methods. Human evaluation involves pairwise comparisons across 10 motion categories with 17 participants. Random selection and our Motive both select 10% of the training data, with our method using majority vote aggregation across all motion queries.

VBench Evaluation

← Scroll to see results on different models →

Wan2.1-T2V-1.3B

Method	Subject Consist.	Background Consist.	Motion Smooth.	Dynamic Degree	Aesthetic Quality	Imaging Quality
Base	95.3	96.4	96.3	39.6	45.3	65.7
Full fine-tuning	95.9	96.6	96.3	42.0	45.0	63.9
Random selection	95.3	96.6	96.3	41.3	45.7	65.1
Ours w/o MM	95.4	96.1	96.3	43.8	45.7	63.2
Ours (Motive)	96.3	96.1	96.3	47.6	46.0	64.6

Wan2.2-TI2V-5B

Method	Subject Consist.	Background Consist.	Motion Smooth.	Dynamic Degree	Aesthetic Quality	Imaging Quality
Base	94.9	96.4	97.5	42.0	44.4	65.5
Full fine-tuning	95.3	96.5	97.5	45.3	44.8	66.2
Random selection	94.7	96.2	97.3	41.6	44.6	65.2
Ours w/o MM	94.9	96.5	97.4	43.8	45.2	64.8
Ours (Motive)	95.1	96.6	97.6	48.3	45.6	65.5

Table 1: Performance comparison across data selection methods (all values in %, higher is better). Random selection and Motive both use 10% of the training data.

Human Evaluation

Method	Win (%)	Tie (%)	Loss (%)
vs. Base	74.1	12.3	13.6
vs. Random	58.9	12.1	29.0
vs. Full Finetuning	53.1	14.8	32.1
vs. Ours without motion masking	46.9	20.0	33.1

Table 2: Pairwise comparisons across 50 videos with 17 participants (850 total). Win, tie, and loss rates show where our method is preferred, rated equal, or outperformed.

Why Motion Masking Matters

Without motion masking, attribution focuses on static appearance rather than dynamics. Our motion-aware weighting isolates temporal patterns. Below, we show example training clips whose influence scores changed significantly with motion masking, leading to better data selection.

Without Motion Masking	With Motion Masking (Ours)
Attributes based on appearance similarity Problem: Identifies training clips with similar colors, objects, or scenes, but may have conflicting motion patterns. Result: VBench Dynamic Degree: 43.8%	Attributes based on motion patterns Solution: Focuses on dynamic regions using optical flow masks, identifying clips that improve temporal dynamics. Result: VBench Dynamic Degree: 47.6% (+3.8%)

Key Results

74.1%

Human preference win rate
(vs. baseline model)

+8.0%

Improved dynamic degree
(on VBench metrics)

Capability

Better motion quality in generated videos

Scalability

Scale to large video models
(billion-parameter models)

Conclusion

Key Contributions

Scalable Attribution: A gradient-based approach using projection and single-sample estimation
Motion-Aware Weighting: Optical flow masks that disentangle motion from appearance
Influential Subset Selection: Strategies that identify training clips shaping target motion patterns

Impact

Motive-curated data yields substantial improvements in motion quality, achieving a 74.1% human preference win rate over baseline models while maintaining computational efficiency. Our work enables researchers and practitioners to build better models through principled data curation.

Future Directions

Multi-modal attribution across appearance and motion
Online data selection during training
Applications to world models

Citation


                        @article{wu2026motion,

                          title={Motion Attribution for Video Generation},

                          author={Xindi Wu and Despoina Paschalidou and Jun Gao and Antonio Torralba and Laura Leal-Taixé and Olga Russakovsky and Sanja Fidler and Jonathan Lorraine},

                          journal={arXiv preprint arXiv:2601.08828},

                          year={2026},

                    }

Acknowledgements

We thank the following people (listed alphabetically by last name) for their helpful discussions, feedback, or participation in human studies: Allison Chen, Sanghyuk Chun, Amaya Dharmasiri, Xingyu Fu, Will Hwang, Yifeng Jiang, Amogh Joshi, Chen-Hsuan Lin, Huan Ling, Tiffany Ling, Shaowei Liu, Zhengyi Luo, Rafid Mahmood, Kaleb S. Newman, Julian Ost, Zeeshan Patel, Davis Rempe, Anya Tsvetkov, Esin Tureci, Sheng-Yu Wang, Tingwu Wang, Zian Wang, Hongyu Wen, Jon Williams, Donglai Xiang, Yilun Xu, William Yang, and Haotian Zhang.