GENMO: A GENeralist Model for Human MOtion

GENMO: A generalist model for human motion that handles multiple tasks with a single model, supporting diverse conditioning signals including video, keypoints, text, audio, and 3D keyframes.

Abstract

Human motion modeling traditionally separates motion generation and estimation into distinct tasks with specialized models. Motion generation models focus on creating diverse, realistic motions from inputs like text, audio, or keyframes, while motion estimation models aim to reconstruct accurate motion trajectories from observations like videos. Despite sharing underlying representations of temporal dynamics and kinematics, this separation limits knowledge transfer between tasks and requires maintaining separate models. We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework. Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals. Leveraging the synergy between regression and diffusion, GENMO achieves accurate global motion estimation while enabling diverse motion generation. We also introduce an estimation-guided training objective that exploits in-the-wild videos with 2D annotations and text descriptions to enhance generative diversity. Furthermore, our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control. This unified approach creates synergistic benefits: generative priors improve estimated motions under challenging conditions like occlusions, while diverse video data enhances generation capabilities. Extensive experiments demonstrate GENMO's effectiveness as a generalist framework that successfully handles multiple human motion tasks within a single model.

Narrated Video

All results are generated using a single unified model.

Motion Generation with Mixed Conditions — Video, Text, Music, and 3D Keyframes

GENMO can generate motions using a sequence of conditions: starting from a video, following a text, and back to a video.

GENMO allows users to switch the text or provide 3D keyframes to further control the generation.

GENMO also allows users to switch the first video.

Users can also add more conditioning such as music.

Motion Generation with Multiple Texts

GENMO seamlessly generates diverse human motions from multiple text prompts across editable time intervals, allowing for precise creative control.

In-the-Wild Global Human Motion Estimation

GENMO achieves state-of-the-art performance on in-the-wild global human motion estimation.

Users can also use various text prompts to seamlessly connect the above two videos.

Here is another example with the same text prompt but two different videos.

Arbitrary Length Motion Generation

GENMO can generate arbitrary length motions all in a single diffusion forward pass, without complex post-processing.

Here is an arbitrary-length motion generation example with multiple interleaved videos and text prompts.

Music-to-Dance Generation

Examples of motion generated with music inputs.

Citation

      
@article{genmo2025,
  title={GENMO: Generative Models for Human Motion Synthesis},
  author={Li, Jiefeng and Cao, Jinkun and Zhang, Haotian and Rempe, Davis and Kautz, Jan and Iqbal, Umar and Yuan, Ye},
  journal={arXiv preprint arXiv:2505.01425},
  year={2025}
}

Template adapted from GLAMR.