Model#
Core model architecture, diffusion logic, and text encoders.
Kimodo Model#
Kimodo model: denoiser, text encoder, diffusion sampling, and post-processing.
- class kimodo.model.kimodo_model.Kimodo(
- denoiser,
- text_encoder,
- num_base_steps,
- device=None,
- cfg_type='separated',
Bases:
<Mock object at 0x768058fb1c90>[]Helper class for test time.
- property output_skeleton#
Skeleton used for model output (somaskel77 for SOMA, else unchanged).
- denoising_step(
- motion,
- pad_mask,
- text_feat,
- text_pad_mask,
- t,
- first_heading_angle,
- motion_mask,
- observed_motion,
- num_denoising_steps,
- cfg_weight,
- guide_masks=None,
- cfg_type=None,
Single denoising step.
- Returns:
[B, T, D] noisy motion input to t-1
- Return type:
- __call__(
- prompts,
- num_frames,
- num_denoising_steps,
- multi_prompt=False,
- constraint_lst=[],
- cfg_weight=[2.0, 2.0],
- num_samples=None,
- cfg_type=None,
- return_numpy=False,
- first_heading_angle=None,
- num_transition_frames=5,
- share_transition=True,
- percentage_transition_override=0.1,
- post_processing=False,
- root_margin=0.04,
- progress_bar=tqdm.auto.tqdm,
Generate motion from text prompts and optional kinematic constraints.
When a single prompt/num_frames pair is given, one motion is generated. Passing lists of prompts and/or num_frames produces a batch of independent motions. With
multi_prompt=True, the prompts are treated as sequential segments that are generated and stitched together with smooth transitions.- Parameters:
prompts – One or more text descriptions of the desired motion. A single string generates one sample; a list generates a batch (or sequential segments when
multi_prompt=True).num_frames – Duration of the generated motion in frames. Can be a single int applied to every prompt or a per-prompt list.
num_denoising_steps – Number of DDIM denoising steps. More steps generally improve quality at the cost of speed.
multi_prompt – If
True, treatpromptsas an ordered sequence of segments and concatenate them with transitions.constraint_lst – Per-sample list of kinematic constraints (e.g. keyframe poses, end-effector targets, 2-D paths). Pass an empty list for unconstrained generation.
cfg_weight – Classifier-free guidance scale(s). A two-element list
[text_cfg, constraint_cfg]controls text and constraint guidance independently.num_samples – Number of samples to generate.
cfg_type – Override the default CFG strategy set at init (e.g.
"separated").return_numpy – If
True, convert all output tensors to numpy arrays.first_heading_angle – Initial body heading in radians. Shape
(B,)or scalar. Defaults to0(facing +Z).num_transition_frames – Number of overlapping frames used to blend consecutive segments in multi-prompt mode.
share_transition – If
True, transition frames are shared between adjacent segments rather than appended.percentage_transition_override – Fraction of each segment’s length that may be overridden by the transition blend.
post_processing – If
True, apply post-processing (foot-skate cleanup and constraint enforcement).root_margin – Horizontal margin (in meters) used by the post-processor to determine when to correct root motion. When root deviates more than margin from the constraint, the post-processor will correct it.
progress_bar – Callable wrapping an iterable to display progress (default:
tqdm). Pass a no-op to silence output.
- Returns:
A dictionary of motion tensors (or numpy arrays if
return_numpy=True) with the following keys:local_rot_mats– Local joint rotations as rotation matrices.global_rot_mats– Global joint rotations as rotation matrices.posed_joints– Joint positions in world space.root_positions– Root joint positions.smooth_root_pos– Smoothed root trajectory.foot_contacts– Boolean foot-contact labels [left heel, left toe, right heel, right toe].global_root_heading– Root heading angle over time.
- Return type:
Denoiser and Backbone#
Two-stage transformer denoiser: root stage then body stage for motion diffusion.
- class kimodo.model.twostage_denoiser.TwostageDenoiser(
- motion_rep,
- motion_mask_mode,
- ckpt_path=None,
- **kwargs,
Bases:
<Mock object at 0x768058d7bfd0>[]Two-stage denoiser: first predicts global root features, then body features conditioned on local root.
- __init__(
- motion_rep,
- motion_mask_mode,
- ckpt_path=None,
- **kwargs,
Build root and body transformer blocks; optionally load checkpoint from ckpt_path.
- load_ckpt(ckpt_path)[source]#
Load checkpoint from path; state dict keys are stripped of ‘denoiser.backbone.’ prefix.
- forward(
- x,
- x_pad_mask,
- text_feat,
- text_feat_pad_mask,
- timesteps,
- first_heading_angle=None,
- motion_mask=None,
- observed_motion=None,
- Parameters:
x (torch.Tensor) – [B, T, dim_motion] current noisy motion
x_pad_mask (torch.Tensor) – [B, T] attention mask, positions with True are allowed to attend, False are not
text_feat (torch.Tensor) – [B, max_text_len, llm_dim] embedded text prompts
text_feat_pad_mask (torch.Tensor) – [B, max_text_len] attention mask, positions with True are allowed to attend, False are not
timesteps (torch.Tensor) – [B,] current denoising step
motion_mask
observed_motion
- Returns:
same size as input x
- Return type:
Transformer backbone: padding, masking, and encoder stack for the denoiser.
- kimodo.model.backbone.pad_x_and_mask_to_fixed_size(x, mask, size)[source]#
Pad a feature vector x and the mask to always have the same size.
- Parameters:
x (torch.Tensor) – [B, T, D]
mask (torch.Tensor) – [B, T]
size (int)
- Returns:
[B, size, D] torch.Tensor: [B, size]
- Return type:
- class kimodo.model.backbone.TransformerEncoderBlock(conf)[source]#
Bases:
<Mock object at 0x768058d799f0>[]- forward(
- x,
- x_pad_mask,
- text_feat,
- text_feat_pad_mask,
- timesteps,
- first_heading_angle=None,
- Parameters:
x (torch.Tensor) – [B, T, dim_motion] current noisy motion
x_pad_mask (torch.Tensor) – [B, T] attention mask, positions with True are allowed to attend, False are not
text_feat (torch.Tensor) – [B, max_text_len, llm_dim] embedded text prompts
text_feat_pad_mask (torch.Tensor) – [B, max_text_len] attention mask, positions with True are allowed to attend, False are not
timesteps (torch.Tensor) – [B,] current denoising step
- Returns:
[B, T, output_dim]
- Return type:
- class kimodo.model.backbone.PositionalEncoding(d_model, dropout=0.1, max_len=5000)[source]#
Bases:
<Mock object at 0x768058dd2c20>[]Non-learned positional encoding.
- forward(x)[source]#
Apply positional encoding to input sequence.
- Parameters:
x (torch.Tensor) – [B, T, D] input motion sequence
- Returns:
[B, T, D] input motion with PE added to it (and optionally dropout)
- Return type:
- class kimodo.model.backbone.TimestepEmbedder(latent_dim, sequence_pos_encoder)[source]#
Bases:
<Mock object at 0x768058dd08e0>[]Encoder for diffusion step.
- __init__(latent_dim, sequence_pos_encoder)[source]#
- Parameters:
latent_dim (int) – dim to encode to
sequence_pos_encoder (PositionalEncoding) – the PE to use on timesteps
- forward(timesteps)[source]#
Embed timesteps by adding PE then going through linear layers.
- Parameters:
timesteps (torch.Tensor) – [B]
- Returns:
[B, 1, D]
- Return type:
Classifier-Free Guidance#
Classifier-free guidance wrapper for the denoiser at sampling time.
- class kimodo.model.cfg.ClassifierFreeGuidedModel(model, cfg_type='separated')[source]#
Bases:
<Mock object at 0x768058c72230>[]Wrapper around denoiser to use classifier-free guidance at sampling time.
- __init__(model, cfg_type='separated')[source]#
Wrap the denoiser for classifier-free guidance; cfg_type in CFG_TYPES (e.g. ‘regular’, ‘nocfg’).
- forward(
- cfg_weight,
- x,
- x_pad_mask,
- text_feat,
- text_feat_pad_mask,
- timesteps,
- first_heading_angle=None,
- motion_mask=None,
- observed_motion=None,
- cfg_type=None,
- Parameters:
cfg_weight (float) – guidance weight float or tuple of floats with (text, constraint) weights if using separated cfg
x (torch.Tensor) – [B, T, dim_motion] current noisy motion
x_pad_mask (torch.Tensor) – [B, T] attention mask, positions with True are allowed to attend, False are not
text_feat (torch.Tensor) – [B, max_text_len, llm_dim] embedded text prompts
text_feat_pad_mask (torch.Tensor) – [B, max_text_len] attention mask, positions with True are allowed to attend, False are not
timesteps (torch.Tensor) – [B,] current denoising step
motion_mask
observed_motion
neutral_joints (torch.Tensor) – [B, nbjoints] The neutral joints of the motions
- Returns:
same size as input x
- Return type:
Model Loading#
Load Kimodo diffusion models from local checkpoints or Hugging Face.
- kimodo.model.load_model.load_model(
- modelname=None,
- device=None,
- eval_mode=True,
- default_family='Kimodo',
- return_resolved_name=False,
Load a kimodo model by name (e.g. ‘g1’, ‘soma’).
Resolution of partial/full names (e.g. Kimodo-SOMA-RP-v1, SOMA) is done inside this function using default_family when the name is not a known short key.
- Parameters:
modelname – Model identifier; uses DEFAULT_MODEL if None. Can be a short key, a full name (e.g. Kimodo-SOMA-RP-v1), or a partial name; unknown names are resolved via resolve_model_name using default_family.
device – Target device for the model (e.g. ‘cuda’, ‘cpu’).
eval_mode – If True, set model to eval mode.
default_family – Used when modelname is not in AVAILABLE_MODELS to resolve partial names (“Kimodo” for demo/generation, “TMR” for embed script). Default “Kimodo”.
return_resolved_name – If True, return (model, resolved_short_key). If False, return only the model.
- Returns:
Loaded model in eval mode, or (model, resolved short key) if return_resolved_name is True.
- Raises:
ValueError – If modelname is not in AVAILABLE_MODELS and cannot be resolved.
FileNotFoundError – If config.yaml is missing in the checkpoint folder.
Text Encoder#
Remote text encoder API client (Gradio) for motion generation.