Spatial Intelligence Lab NVIDIA Research

MoRight: Motion Control Done Right

1 NVIDIA
2 University of Illinois Urbana-Champaign

We present MoRight, a unified motion-controllable video generation model that: (1) disentangles camera and motion control, allowing users to separately control each of them, and (2) models the causal relationships of motions, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. At inference, users can either supply active motion and MoRight predicts consequences, or specify desired passive outcomes and MoRight recovers plausible driving actions, all while freely adjusting the camera viewpoint.

Forward Motion Causality Reasoning


The user specifies an action (e.g., action of hand), and the model “simulates” the consequences of the action (e.g. clothes move), predicting what happens next.
Example 1
Example 2
Example 3
Example 4
Motion-1
Motion-2
Motion-3
InputGenerated
InputGenerated
InputGenerated
Motion-1
Motion-2
Motion-3
InputGenerated
InputGenerated
InputGenerated
Motion-1
Motion-2
Motion-3
InputGenerated
InputGenerated
InputGenerated
Motion-1
Motion-2
Motion-3
InputGenerated
InputGenerated
InputGenerated

Visualization: Drag the slider to compare input with track visualization (left) with the generated video from MoRight (right).

Inverse Motion Causality Reasoning


The user specifies an outcome (e.g., a ball rolling), and the model “reasons” what action caused the outcome (e.g. the human moves), effectively planning a plausible action.
Example 1
Example 2
Example 3
Example 4
Motion-1
Motion-2
Motion-3
InputGenerated
InputGenerated
InputGenerated
Motion-1
Motion-2
Motion-3
InputGenerated
InputGenerated
InputGenerated
Motion-1
Motion-2
Motion-3
InputGenerated
InputGenerated
InputGenerated
Motion-1
Motion-2
Motion-3
InputGenerated
InputGenerated
InputGenerated

Visualization: Drag the slider to compare input with track visualization (left) with the generated video from MoRight (right).

Disentangled Camera–Object Control


The user specifies object motion and camera viewpoint independently. MoRight generates the same action across different viewpoints. Subtle differences may appear due to the stochastic nature of generation.
Scene 1
Scene 2
Scene 3
Scene 4
Scene 5
Scene 6
Scene 7
Input
Orbit Right
Zoom In
Zoom Out
Input
Static Camera
Zoom In
Zoom Out
Input
Orbit Left
Zoom In
Zoom Out
Input
Orbit Left
Zoom In
Zoom Out
Input
Orbit Left
Zoom In
Zoom Out
Input
Orbit Left
Zoom In
Zoom Out
Input
Orbit Left
Zoom In
Zoom Out

Visualization: Each row shows the same action from the user (e.g. moving the tongs) with different viewpoints. Each column has the same viewpoint but different motion. Note that we only provide the active motion (motion of the tongs), and model reasons for the consequences (e.g. the motion of butter).

Method


Active vs. passive motion

We use a dual-stream architecture to separate motion and viewpoint: one stream captures object motion on the canonical frame, the other models camera motion. The model learns to transfer object motion across views via cross-view self-attention. We further decompose motion into active (actions) and passive (responses), enabling the model to simulate cause-and-effect, predicting consequences from actions and inferring actions from outcomes.

Method overview

Qualitative Comparison


Each row shows the same object motion under two camera controls (columns). ATI and WanMove use privileged depth with full active+passive tracks; MoRight uses only active tracks without privileged information.
Input ATI WanMove MoRight (Ours)
Scene 1
Scene 2
Scene 3
Scene 4
Scene 5
Input (Static)
ATI
Input (Zoom In)
ATI
Input (Static)
WanMove (Static)
Input (Zoom In)
WanMove (Zoom In)
Input (Static)
MoRight (Ours)
Input (Zoom In)
MoRight (Ours)
Input (Static)
ATI
Input (Zoom In)
ATI
Input (Static)
WanMove (Static)
Input (Zoom In)
WanMove (Zoom In)
Input (Static)
MoRight (Ours)
Input (Zoom In)
MoRight (Ours)
Input (Orbit L)
ATI
Input (Zoom Out)
ATI
Input (Orbit L)
WanMove (Orbit L)
Input (Zoom Out)
WanMove (Zoom Out)
Input (Orbit L)
MoRight (Ours)
Input (Zoom Out)
MoRight (Ours)
Input (Orbit D)
ATI
Input (Zoom Out)
ATI
Input (Orbit D)
WanMove (Orbit D)
Input (Zoom Out)
WanMove (Zoom Out)
Input (Orbit D)
MoRight (Ours)
Input (Zoom Out)
MoRight (Ours)
Input (Static)
ATI
Input (Zoom Out)
ATI
Input (Static)
WanMove (Static)
Input (Zoom Out)
WanMove (Zoom Out)
Input (Static)
MoRight (Ours)
Input (Zoom Out)
MoRight (Ours)

Limitations


MoRight
Ground Truth

① Incorrect interaction reasoning may lead to implausible outcomes (kabobs merging).

MoRight
Ground Truth

② Unnatural motion when input tracks become temporally sparse due to occlusion (hand).

MoRight
Ground Truth

③ Physically unrealistic dynamics may appear, such as objects disappearing during motion (soccer ball).

MoRight
Ground Truth

④ Hallucinated content may emerge in later frames (extra hand).

Citation


@article{liu2025moright,
  title   = {MoRight: Motion Control Done Right},
  author  = {Liu, Shaowei and Ren, Xuanchi and Shen, Tianchang and Ling, Huan and
             Gupta, Saurabh and Wang, Shenlong and Fidler, Sanja and Gao, Jun},
  journal = {arXiv preprint arXiv:2604.07348},
  year    = {2025}
}