Spatial Intelligence Lab
Motive
Motion Attribution for Video Generation

Xindi Wu1 2
Despoina Paschalidou1
Jun Gao1
Antonio Torralba3
Laura Leal-Taixé1
Olga Russakovsky2
Sanja Fidler1
Jonathan Lorraine1

1NVIDIA
2Princeton University
3MIT CSAIL
arXiv Paper Slides Code (Coming Soon)


TL;DR: We propose MOTIVE, a scalable, motion-centric data attribution framework for video generation to identify which training clips improve or degrade motion dynamics, enabling curation and more.

Abstract: Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.




Motion Attribution Examples

MOTIVE identifies influential training clips across different motion types. We demonstrate how our method attributes motion in video generation by showing positive and negative influential samples for different query motions.


← Scroll horizontally to see more motion examples →





MOTIVE Overview

Our proposed framework, MOTIVE, attributes motion in video diffusion models and uses it to curate finetuning data. The method has three key components: scalable gradient computation, frame-length bias fix, and motion-aware weighting. Below, we overview the key steps:


Motion Attribution

Motion-gradient computation has three steps: (1) detect motion with AllTracker; (2) compute motion-magnitude patches; (3) apply loss-space motion masks to focus gradients on dynamic regions.


Efficient Motion Gradient Computation

Our method is made scalable via a single-sample variant with common randomness and a projection, computed for each pair of training and query data, aggregated for a final ranking, and eventually used to select finetuning subsets.



Problem Formulation. We study data attribution for motion in the finetuning setting. Given a query video and a finetuning dataset, we assign each training clip a motion-aware influence score that quantifies how it contributes to the dynamics observed in the query. Our framework satisfies two key criteria: (i) predictivity, rankings correlate with observed changes when finetuning on the most influential subsets, and (ii) efficiency, scales to modern video generators without expensive Hessian inversion or per-data integration.




Method


1. Scalable Gradient-Based Attribution

We make attribution practical for billion-parameter models through several approximations:

  • Inverse-Hessian Approximation: Use gradient similarity with identity preconditioner instead of computing exact inverse-Hessian-vector products.
  • Common Randomness: Evaluate training and test gradients under the same (timestep, noise) pairs to reduce variance and stabilize rankings.
  • Single-Sample Estimator: Fix a single timestep and shared noise draw for all train-test pairs, reducing compute from O(|dataset| · |timesteps| · cost) to O(|dataset| · cost).
  • Fastfood Projection: Apply structured Johnson-Lindenstrauss projection to compress gradients, reducing storage from O(|dataset| · dim) to O(|dataset| · projected_dim), making it tractable for modern models.

2. Motion Attribution via Motion Masking

To isolate temporal dynamics from static appearance, we introduce motion-weighted gradients:

  • Motion Detection: Use AllTracker to extract optical flow and motion magnitudes in pixel space.
  • Motion Weighting: Compute motion magnitude at each location, min-max normalize to [0,1], and bilinearly downsample to match latent space dimensions.
  • Loss-Space Masking: Reweight per-location gradients by motion masks, emphasizing dynamic regions and de-emphasizing static backgrounds.
  • Motion-Aware Influence: Compute influence scores using motion-weighted gradients, so rankings identify training clips that shape motion rather than appearance.

3. Influential Subset Selection

Single-query: We select the highest-scoring examples based on motion-aware influence scores.

Multi-query aggregation: For multiple query motions, we adopt majority voting: a training sample receives a vote if its score exceeds a percentile threshold for that query. We rank samples by total votes and select the top examples to form the finetuning subset, emphasizing samples consistently influential across queries.

Data Selection Strategies

MOTIVE supports two data selection strategies, depending on whether you want to optimize for a single target motion or multiple diverse motions.


Single-Query Selection

Optimize for one specific motion

Query: "Roll"
I = +0.89
I = +0.86
I = +0.84
Select Top-K by Score

Multi-Query Aggregation

Vote if the score is over a certain threshold

Q1: "Roll"
Q2: "Spin"
Q3: "Slide"
Clip A: ✓✓✓
Clip B: ✓✓
Clip C:
Select Top-K by Votes



Motion Visualization

MOTIVE uses optical flow to detect and visualize motion in videos. Below we show how our method identifies dynamic regions by comparing original videos with motion visualizations that isolate and highlight moving regions.

Original Video
Motion Highlighted
← Scroll to view more examples →

Interpretation: The motion visualization clearly isolates and highlights dynamic regions (people, moving objects, camera motion) while filtering out static backgrounds. This motion-aware weighting enables MOTIVE to focus attribution on temporal patterns rather than appearance, leading to better data selection for improving motion quality.




Motion Quality Improvements with MOTIVE-Curated Data

We use MOTIVE to curate finetuning data for video generation models. Below, we compare videos from models finetuned with randomly selected data versus MOTIVE-selected high-influence data.


Motion Type Base Random Selection Motive (Ours)
Compress
A rubber ball being compressed under a flat press
Spin
A single coin spins quickly on a polished glass surface
Slide
A white mug slid across a wooden kitchen counter
Free Fall
A red ball drops vertically from above onto wooden surface




Quantitative Results

We evaluate our motion attribution framework on VBench and through human evaluation experiments. VBench compares subject consistency, background consistency, motion smoothness, dynamic degree, aesthetic quality, and imaging quality across different data selection methods. Human evaluation involves pairwise comparisons across 10 motion categories with 17 participants. Random selection and our MOTIVE both select 10% of the training data, with our method using majority vote aggregation across all motion queries.


VBench Evaluation
← Scroll to see results on different models →

Table 1: Performance comparison across data selection methods (all values in %, higher is better). Random selection and MOTIVE both use 10% of the training data.

Human Evaluation
Method Win (%) Tie (%) Loss (%)
vs. Base 74.1 12.3 13.6
vs. Random 58.9 12.1 29.0
vs. Full Finetuning 53.1 14.8 32.1
vs. Ours without motion masking 46.9 20.0 33.1

Table 2: Pairwise comparisons across 50 videos with 17 participants (850 total). Win, tie, and loss rates show where our method is preferred, rated equal, or outperformed.


Why Motion Masking Matters

Without motion masking, attribution focuses on static appearance rather than dynamics. Our motion-aware weighting isolates temporal patterns. Below, we show example training clips whose influence scores changed significantly with motion masking, leading to better data selection.

Without Motion Masking With Motion Masking (Ours)
Attributes based on appearance similarity

Problem: Identifies training clips with similar colors, objects, or scenes, but may have conflicting motion patterns.

Result: VBench Dynamic Degree: 43.8%

Attributes based on motion patterns

Solution: Focuses on dynamic regions using optical flow masks, identifying clips that improve temporal dynamics.

Result: VBench Dynamic Degree: 47.6% (+3.8%)


Key Results
74.1%
Human preference win rate
(vs. baseline model)
+8.0%
Improved dynamic degree
(on VBench metrics)
Capability
Better motion quality in generated videos
Scalability
Scale to large video models
(billion-parameter models)



Conclusion


Key Contributions

Impact

MOTIVE-curated data yields substantial improvements in motion quality, achieving a 74.1% human preference win rate over baseline models while maintaining computational efficiency. Our work enables researchers and practitioners to build better models through principled data curation.

Future Directions




Citation

@article{wu2026motion,
  title={Motion Attribution for Video Generation},
  author={Xindi Wu and Despoina Paschalidou and Jun Gao and Antonio Torralba and Laura Leal-Taixé and Olga Russakovsky and Sanja Fidler and Jonathan Lorraine},
  journal={Preprint},
  year={2025},
}

Acknowledgements

We thank the following people (listed alphabetically by last name) for their helpful discussions, feedback, or participation in human studies: Allison Chen, Sanghyuk Chun, Amaya Dharmasiri, Xingyu Fu, Will Hwang, Yifeng Jiang, Amogh Joshi, Chen-Hsuan Lin, Huan Ling, Tiffany Ling, Shaowei Liu, Zhengyi Luo, Rafid Mahmood, Kaleb S. Newman, Julian Ost, Zeeshan Patel, Davis Rempe, Anya Tsvetkov, Esin Tureci, Sheng-Yu Wang, Tingwu Wang, Zian Wang, Hongyu Wen, Jon Williams, Donglai Xiang, Yilun Xu, William Yang, and Haotian Zhang.