|
TL;DR: We propose MOTIVE, a scalable, motion-centric data attribution framework for video generation to identify which training clips improve or degrade motion dynamics, enabling curation and more.
|
|
Abstract: Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data. |
|
MOTIVE identifies influential training clips across different motion types. We demonstrate how our method attributes motion in video generation by showing positive and negative influential samples for different query motions. |
|
Our proposed framework, MOTIVE, attributes motion in video diffusion models and uses it to curate finetuning data. The method has three key components: scalable gradient computation, frame-length bias fix, and motion-aware weighting. Below, we overview the key steps: |
Motion-gradient computation has three steps: (1) detect motion with AllTracker; (2) compute motion-magnitude patches; (3) apply loss-space motion masks to focus gradients on dynamic regions.
Our method is made scalable via a single-sample variant with common randomness and a projection, computed for each pair of training and query data, aggregated for a final ranking, and eventually used to select finetuning subsets.
|
Problem Formulation. We study data attribution for motion in the finetuning setting. Given a query video and a finetuning dataset, we assign each training clip a motion-aware influence score that quantifies how it contributes to the dynamics observed in the query. Our framework satisfies two key criteria: (i) predictivity, rankings correlate with observed changes when finetuning on the most influential subsets, and (ii) efficiency, scales to modern video generators without expensive Hessian inversion or per-data integration. |
1. Scalable Gradient-Based AttributionWe make attribution practical for billion-parameter models through several approximations:
|
2. Motion Attribution via Motion MaskingTo isolate temporal dynamics from static appearance, we introduce motion-weighted gradients:
|
3. Influential Subset Selection
Single-query: We select the highest-scoring examples based on motion-aware influence scores.
|
|
MOTIVE supports two data selection strategies, depending on whether you want to optimize for a single target motion or multiple diverse motions. |
Optimize for one specific motion
Vote if the score is over a certain threshold
MOTIVE uses optical flow to detect and visualize motion in videos. Below we show how our method identifies dynamic regions by comparing original videos with motion visualizations that isolate and highlight moving regions.
Interpretation: The motion visualization clearly isolates and highlights dynamic regions (people, moving objects, camera motion) while filtering out static backgrounds. This motion-aware weighting enables MOTIVE to focus attribution on temporal patterns rather than appearance, leading to better data selection for improving motion quality.
|
We use MOTIVE to curate finetuning data for video generation models. Below, we compare videos from models finetuned with randomly selected data versus MOTIVE-selected high-influence data. |
|
We evaluate our motion attribution framework on VBench and through human evaluation experiments. VBench compares subject consistency, background consistency, motion smoothness, dynamic degree, aesthetic quality, and imaging quality across different data selection methods. Human evaluation involves pairwise comparisons across 10 motion categories with 17 participants. Random selection and our MOTIVE both select 10% of the training data, with our method using majority vote aggregation across all motion queries. |
Table 1: Performance comparison across data selection methods (all values in %, higher is better). Random selection and MOTIVE both use 10% of the training data.
Table 2: Pairwise comparisons across 50 videos with 17 participants (850 total). Win, tie, and loss rates show where our method is preferred, rated equal, or outperformed.
Without motion masking, attribution focuses on static appearance rather than dynamics. Our motion-aware weighting isolates temporal patterns. Below, we show example training clips whose influence scores changed significantly with motion masking, leading to better data selection.
| Without Motion Masking | With Motion Masking (Ours) |
|---|---|
|
Attributes based on appearance similarity
Problem: Identifies training clips with similar colors, objects, or scenes, but may have conflicting motion patterns.
|
Attributes based on motion patterns
Solution: Focuses on dynamic regions using optical flow masks, identifying clips that improve temporal dynamics.
|
MOTIVE-curated data yields substantial improvements in motion quality, achieving a 74.1% human preference win rate over baseline models while maintaining computational efficiency. Our work enables researchers and practitioners to build better models through principled data curation.
|
@article{wu2026motion,
|
|
We thank the following people (listed alphabetically by last name) for their helpful discussions, feedback, or participation in human studies: Allison Chen, Sanghyuk Chun, Amaya Dharmasiri, Xingyu Fu, Will Hwang, Yifeng Jiang, Amogh Joshi, Chen-Hsuan Lin, Huan Ling, Tiffany Ling, Shaowei Liu, Zhengyi Luo, Rafid Mahmood, Kaleb S. Newman, Julian Ost, Zeeshan Patel, Davis Rempe, Anya Tsvetkov, Esin Tureci, Sheng-Yu Wang, Tingwu Wang, Zian Wang, Hongyu Wen, Jon Williams, Donglai Xiang, Yilun Xu, William Yang, and Haotian Zhang. |