Kimodo:
Scaling Controllable Human
Motion Generation
Kimodo is a kinematic motion diffusion model trained on large-scale optical mocap data. It is controlled through text and constraints to generate high-quality 3D human and robot motions.
High-quality human motion data is becoming increasingly important for applications in robotics, simulation, and entertainment. Recent generative models offer a potential data source, enabling human motion synthesis through intuitive inputs like text prompts or kinematic constraints on poses. However, the small scale of public mocap datasets has limited the motion quality, control accuracy, and generalization of these models. In this work, we introduce Kimodo, an expressive and controllable kinematic motion diffusion model trained on 700 hours of optical motion capture data. Our model generates high-quality motions while being easily controlled through text and a comprehensive suite of kinematic constraints including full-body keyframes, sparse joint positions/rotations, 2D waypoints, and dense 2D paths. This is enabled through a carefully designed motion representation and two-stage denoiser architecture that decomposes root and body prediction to minimize motion artifacts while allowing for flexible constraint conditioning. Experiments on the large-scale mocap dataset justify key design decisions and analyze how the scaling of dataset size and model size affect performance.
Key Capabilities of Kimodo
Text-to-Motion Generation
Kimodo can be intuitively controlled through text prompts to generate a wide range of behaviors.
- Locomotion
- Compositional Locomotion
- Object Interactions
- Dancing
- Stunts
- Gestures
- Stylized Motion
- Diverse Samples
- Sequence of Prompts
Full Body Constraints
The model can be conditioned on kinematic pose constraints on the motion. For example, constraining the full body joint positions at specific frames. Constraints are visualized as red skeletons.
End-Effector Constraints
Various combinations of hands and feet (end-effectors) can also be constrained through joint positions and rotations. In these videos, the red color indicates the constrained joints.
Root Constraints
The global translation of the character can be controlled through 2D waypoints and dense paths. Our smoothed root motion representation enables following straight and curved paths closely with motion that has natural pelvis motion.
Kimodo Code and Authoring Demo
Our code includes a motion authoring demo as a reference implementation for using the model in practice. The demo supports local generation on SOMA and G1 skeletons and allows for easy authoring with constraints and prompt sequences. We provide simple examples of how to use the Python API to generate and export motions for downstream applications.
Application: Data for Robotics
Kimodo trained on the G1 skeleton can generate humanoid demonstration data more quickly and easily than teleoperation. Motions can be exported to formats compatible with ProtoMotions and Mujoco for training physics-based policies.
Kimodo: Kinematic Motion Diffusion
Kimodo is an explicit motion diffusion model that generates 3D human motion by denoising a sequence of skeleton poses.
The model operates on a carefully designed motion representation that enables precise control over generated motion while minimizing
common artifacts, such as floating and foot skating. The motion representation features a smoothed root that
emulates paths drawn in practical animation tools, along with global joint rotations and positions amenable to sparse keyframe constraints.
At each step of the denoising process, the model takes in an embedding of the text prompt, a set of kinematic constraints, and
the current noisy motion. Constraints are specified using the same motion representation as the input motion, and are used to overwrite
the corresponding values in the noisy motion. Additionally, a mask indicating which elements are constrained is concatentated to the input motion.
Given these inputs, the two-stage transformer denoiser predicts a clean motion that aligns with the text and constraints.
The two-stage denoiser decomposes root and body motion prediction: the root denoiser first predicts global root motion, which is
transformed into a local representation as input to the body denoiser. The final output is the concatenation of the two stages.
A key component to effectively train Kimodo is the Bones Rigplay dataset, a large studio mocap dataset containing over 700 hours
of production-quality human motion with corresponding text descriptions.
This large dataset also enables the construction of a comprehensive benchmark to evaluate various design decisions
on a wide range of behaviors and scenarios. Please see the paper for more details.
Humanoid Motion at NVIDIA
Kimodo is a part of a larger research effort at NVIDIA to support humanoid robotics and Physical AI by developing models and tools for 3D humanoid motion. It relies on and interoperates with the projects shown below.
SOMA Body Model
Kimodo is trained on the skeleton of the SOMA parametric body model.
BONES-SEED Dataset
Some versions of Kimodo are trained on the publicly available BONES-SEED dataset, which contains hundreds of hours of production-quality motion capture data on SOMA and G1.
ProtoMotions
The generated motions from Kimodo can be exported to the ProtoMotions framework for training physics-based policies for humanoids.
SOMA Retargeter
The G1 training data for Kimodo was retargeted using the Newton-based SOMA retargeter.
GEM
In a similar spirit to Kimodo, GEM is a motion diffusion model that reconstructs motion from monocular videos.
GEAR SONIC
Motions generated by Kimodo can be used as demonstrations for training robot tracking policies such as GEAR-SONIC.