GEM Logo

A GENeralist Model for Human MOtion

Announcement: GENMO has been renamed to GEM.

GEN Teaser

GEM: A generalist model for human motion that handles multiple tasks with a single model, supporting diverse conditioning signals including video, keypoints, text, audio, and 3D keyframes.

Abstract

Human motion modeling traditionally separates motion generation and estimation into distinct tasks with specialized models. Motion generation models focus on creating diverse, realistic motions from inputs like text, audio, or keyframes, while motion estimation models aim to reconstruct accurate motion trajectories from observations like videos. Despite sharing underlying representations of temporal dynamics and kinematics, this separation limits knowledge transfer between tasks and requires maintaining separate models. We present GEM, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework. Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals. Leveraging the synergy between regression and diffusion, GEM achieves accurate global motion estimation while enabling diverse motion generation. We also introduce an estimation-guided training objective that exploits in-the-wild videos with 2D annotations and text descriptions to enhance generative diversity. Furthermore, our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control. This unified approach creates synergistic benefits: generative priors improve estimated motions under challenging conditions like occlusions, while diverse video data enhances generation capabilities. Extensive experiments demonstrate GEM's effectiveness as a generalist framework that successfully handles multiple human motion tasks within a single model.

All results are generated using a single unified model.

Motion Generation with Mixed Conditions — Video, Text, Music, and 3D Keyframes

GEM can generate motions using a sequence of conditions: starting from a video, following a text, and back to a video.

GEM allows users to switch the text or provide 3D keyframes to further control the generation.

GEM also allows users to switch the first video.

Users can also add more conditioning such as music.

Motion Generation with Multiple Texts

GEM seamlessly generates diverse human motions from multiple text prompts across editable time intervals, allowing for precise creative control.

In-the-Wild Global Human Motion Estimation

GEM achieves state-of-the-art performance on in-the-wild global human motion estimation.

Users can also use various text prompts to seamlessly connect the above two videos.

Here is another example with the same text prompt but two different videos.

Arbitrary Length Motion Generation

GEM can generate arbitrary length motions all in a single diffusion forward pass, without complex post-processing.

Here is an arbitrary-length motion generation example with multiple interleaved videos and text prompts.

Music-to-Dance Generation

Examples of motion generated with music inputs.

Citation

      
@article{genmo2025,
  title={GENMO: Generative Models for Human Motion Synthesis},
  author={Li, Jiefeng and Cao, Jinkun and Zhang, Haotian and Rempe, Davis and Kautz, Jan and Iqbal, Umar and Yuan, Ye},
  journal={arXiv preprint arXiv:2505.01425},
  year={2025}
}
      
    

Template adapted from GLAMR.