MoRight: Motion Control Done Right

¹ NVIDIA

² University of Illinois Urbana-Champaign

description arXiv article Paper Thread code Code (Coming Soon)

We present MoRight, a unified motion-controllable video generation model that: (1) disentangles camera and motion control, allowing users to separately control each of them, and (2) models the causal relationships of motions, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. At inference, users can either supply active motion and MoRight predicts consequences, or specify desired passive outcomes and MoRight recovers plausible driving actions, all while freely adjusting the camera viewpoint.

Forward Motion Causality Reasoning

The user specifies an action (e.g., action of hand), and the model “simulates” the consequences of the action (e.g. clothes move), predicting what happens next.

Motion-1

Motion-2

Motion-3

Motion-1

Motion-2

Motion-3

Motion-1

Motion-2

Motion-3

Motion-1

Motion-2

Motion-3

Visualization: We provide the motion of active objects as input. Overlay shows the input motion overlaid on the generated video from MoRight.

Inverse Motion Causality Reasoning

The user specifies an outcome (e.g., a ball rolling), and the model “reasons” what action caused the outcome (e.g. the human moves), effectively planning a plausible action.

Motion-1

Motion-2

Motion-3

Motion-1

Motion-2

Motion-3

Motion-1

Motion-2

Motion-3

Motion-1

Motion-2

Motion-3

Visualization: We provide the motion of passive objects as input. Overlay shows the input motion overlaid on the generated video from MoRight.

Disentangled Camera–Object Control

The user specifies object motion and camera viewpoint independently. MoRight generates the same action across different viewpoints. Subtle differences may appear due to the stochastic nature of generation.

Input

Orbit Right

Zoom In

Zoom Out

Input

Static Camera

Zoom In

Zoom Out

Input

Orbit Left

Zoom In

Zoom Out

Input

Orbit Left

Zoom In

Zoom Out

Input

Orbit Left

Zoom In

Zoom Out

Input

Orbit Left

Zoom In

Zoom Out

Input

Orbit Left

Zoom In

Zoom Out

Visualization: Each row shows the same action from the user (e.g. moving the tongs) with different viewpoints. Each column has the same viewpoint but different motion. Note that we only provide the active motion (motion of the tongs), and model reasons for the consequences (e.g. the motion of butter).

Method

We use a dual-stream architecture to separate motion and viewpoint: one stream captures object motion on the canonical frame, the other models camera motion. The model learns to transfer object motion across views via cross-view self-attention. We further decompose motion into active (actions) and passive (responses), enabling the model to simulate cause-and-effect, predicting consequences from actions and inferring actions from outcomes.

Qualitative Comparison

Each row shows the same object motion under two camera controls (columns). ATI and WanMove use privileged depth with full active+passive tracks; MoRight uses only active tracks without privileged information.

Input ATI WanMove MoRight (Ours)

Input (Static)

ATI

Input (Zoom In)

ATI

Input (Static)

WanMove (Static)

Input (Zoom In)

WanMove (Zoom In)

Input (Static)

MoRight (Ours)

Input (Zoom In)

MoRight (Ours)

Input (Static)

ATI

Input (Zoom In)

ATI

Input (Static)

WanMove (Static)

Input (Zoom In)

WanMove (Zoom In)

Input (Static)

MoRight (Ours)

Input (Zoom In)

MoRight (Ours)

Input (Orbit L)

ATI

Input (Zoom Out)

ATI

Input (Orbit L)

WanMove (Orbit L)

Input (Zoom Out)

WanMove (Zoom Out)

Input (Orbit L)

MoRight (Ours)

Input (Zoom Out)

MoRight (Ours)

Input (Orbit D)

ATI

Input (Zoom Out)

ATI

Input (Orbit D)

WanMove (Orbit D)

Input (Zoom Out)

WanMove (Zoom Out)

Input (Orbit D)

MoRight (Ours)

Input (Zoom Out)

MoRight (Ours)

Input (Static)

ATI

Input (Zoom Out)

ATI

Input (Static)

WanMove (Static)

Input (Zoom Out)

WanMove (Zoom Out)

Input (Static)

MoRight (Ours)

Input (Zoom Out)

MoRight (Ours)

Limitations

MoRight

Ground Truth

① Unnatural motion when input tracks become temporally sparse due to occlusion (hand).

MoRight

Ground Truth

② Incorrect interaction reasoning may lead to implausible outcomes (kabobs merging).

MoRight

Ground Truth

③ Physically unrealistic dynamics may appear, such as objects disappearing during motion (soccer ball).

MoRight

Ground Truth

④ Hallucinated content may emerge in later frames (extra hand).

Citation

@article{liu2025moright,
  title   = {MoRight: Motion Control Done Right},
  author  = {Liu, Shaowei and Ren, Xuanchi and Shen, Tianchang and Ling, Huan and
             Gupta, Saurabh and Wang, Shenlong and Fidler, Sanja and Gao, Jun},
  journal = {arXiv preprint arXiv:2604.07348},
  year    = {2026}
}