We present MoRight, a unified motion-controllable video generation model that: (1) disentangles camera and motion control, allowing users to separately control each of them, and (2) models the causal relationships of motions, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. At inference, users can either supply active motion and MoRight predicts consequences, or specify desired passive outcomes and MoRight recovers plausible driving actions, all while freely adjusting the camera viewpoint.
Visualization: Drag the slider to compare input with track visualization (left) with the generated video from MoRight (right).
Visualization: Drag the slider to compare input with track visualization (left) with the generated video from MoRight (right).
Visualization: Each row shows the same action from the user (e.g. moving the tongs) with different viewpoints. Each column has the same viewpoint but different motion. Note that we only provide the active motion (motion of the tongs), and model reasons for the consequences (e.g. the motion of butter).
① Incorrect interaction reasoning may lead to implausible outcomes (kabobs merging).
② Unnatural motion when input tracks become temporally sparse due to occlusion (hand).
③ Physically unrealistic dynamics may appear, such as objects disappearing during motion (soccer ball).
④ Hallucinated content may emerge in later frames (extra hand).
@article{liu2025moright,
title = {MoRight: Motion Control Done Right},
author = {Liu, Shaowei and Ren, Xuanchi and Shen, Tianchang and Ling, Huan and
Gupta, Saurabh and Wang, Shenlong and Fidler, Sanja and Gao, Jun},
journal = {arXiv preprint arXiv:2604.07348},
year = {2025}
}