Model#

Core model architecture, diffusion logic, and text encoders.

Kimodo Model#

Kimodo model: denoiser, text encoder, diffusion sampling, and post-processing.

class kimodo.model.kimodo_model.Kimodo(
denoiser,
text_encoder,
num_base_steps,
device=None,
cfg_type='separated',
)[source]#

Bases: <Mock object at 0x768058fb1c90>[]

Helper class for test time.

property output_skeleton#

Skeleton used for model output (somaskel77 for SOMA, else unchanged).

train(mode)[source]#
eval()[source]#
denoising_step(
motion,
pad_mask,
text_feat,
text_pad_mask,
t,
first_heading_angle,
motion_mask,
observed_motion,
num_denoising_steps,
cfg_weight,
guide_masks=None,
cfg_type=None,
)[source]#

Single denoising step.

Returns:

[B, T, D] noisy motion input to t-1

Return type:

torch.Tensor

__call__(
prompts,
num_frames,
num_denoising_steps,
multi_prompt=False,
constraint_lst=[],
cfg_weight=[2.0, 2.0],
num_samples=None,
cfg_type=None,
return_numpy=False,
first_heading_angle=None,
num_transition_frames=5,
share_transition=True,
percentage_transition_override=0.1,
post_processing=False,
root_margin=0.04,
progress_bar=tqdm.auto.tqdm,
)[source]#

Generate motion from text prompts and optional kinematic constraints.

When a single prompt/num_frames pair is given, one motion is generated. Passing lists of prompts and/or num_frames produces a batch of independent motions. With multi_prompt=True, the prompts are treated as sequential segments that are generated and stitched together with smooth transitions.

Parameters:
  • prompts – One or more text descriptions of the desired motion. A single string generates one sample; a list generates a batch (or sequential segments when multi_prompt=True).

  • num_frames – Duration of the generated motion in frames. Can be a single int applied to every prompt or a per-prompt list.

  • num_denoising_steps – Number of DDIM denoising steps. More steps generally improve quality at the cost of speed.

  • multi_prompt – If True, treat prompts as an ordered sequence of segments and concatenate them with transitions.

  • constraint_lst – Per-sample list of kinematic constraints (e.g. keyframe poses, end-effector targets, 2-D paths). Pass an empty list for unconstrained generation.

  • cfg_weight – Classifier-free guidance scale(s). A two-element list [text_cfg, constraint_cfg] controls text and constraint guidance independently.

  • num_samples – Number of samples to generate.

  • cfg_type – Override the default CFG strategy set at init (e.g. "separated").

  • return_numpy – If True, convert all output tensors to numpy arrays.

  • first_heading_angle – Initial body heading in radians. Shape (B,) or scalar. Defaults to 0 (facing +Z).

  • num_transition_frames – Number of overlapping frames used to blend consecutive segments in multi-prompt mode.

  • share_transition – If True, transition frames are shared between adjacent segments rather than appended.

  • percentage_transition_override – Fraction of each segment’s length that may be overridden by the transition blend.

  • post_processing – If True, apply post-processing (foot-skate cleanup and constraint enforcement).

  • root_margin – Horizontal margin (in meters) used by the post-processor to determine when to correct root motion. When root deviates more than margin from the constraint, the post-processor will correct it.

  • progress_bar – Callable wrapping an iterable to display progress (default: tqdm). Pass a no-op to silence output.

Returns:

A dictionary of motion tensors (or numpy arrays if return_numpy=True) with the following keys:

  • local_rot_mats – Local joint rotations as rotation matrices.

  • global_rot_mats – Global joint rotations as rotation matrices.

  • posed_joints – Joint positions in world space.

  • root_positions – Root joint positions.

  • smooth_root_pos – Smoothed root trajectory.

  • foot_contacts – Boolean foot-contact labels [left heel, left toe, right heel, right toe].

  • global_root_heading – Root heading angle over time.

Return type:

dict

Denoiser and Backbone#

Two-stage transformer denoiser: root stage then body stage for motion diffusion.

class kimodo.model.twostage_denoiser.TwostageDenoiser(
motion_rep,
motion_mask_mode,
ckpt_path=None,
**kwargs,
)[source]#

Bases: <Mock object at 0x768058d7bfd0>[]

Two-stage denoiser: first predicts global root features, then body features conditioned on local root.

__init__(
motion_rep,
motion_mask_mode,
ckpt_path=None,
**kwargs,
)[source]#

Build root and body transformer blocks; optionally load checkpoint from ckpt_path.

load_ckpt(ckpt_path)[source]#

Load checkpoint from path; state dict keys are stripped of ‘denoiser.backbone.’ prefix.

forward(
x,
x_pad_mask,
text_feat,
text_feat_pad_mask,
timesteps,
first_heading_angle=None,
motion_mask=None,
observed_motion=None,
)[source]#
Parameters:
  • x (torch.Tensor) – [B, T, dim_motion] current noisy motion

  • x_pad_mask (torch.Tensor) – [B, T] attention mask, positions with True are allowed to attend, False are not

  • text_feat (torch.Tensor) – [B, max_text_len, llm_dim] embedded text prompts

  • text_feat_pad_mask (torch.Tensor) – [B, max_text_len] attention mask, positions with True are allowed to attend, False are not

  • timesteps (torch.Tensor) – [B,] current denoising step

  • motion_mask

  • observed_motion

Returns:

same size as input x

Return type:

torch.Tensor

Transformer backbone: padding, masking, and encoder stack for the denoiser.

kimodo.model.backbone.pad_x_and_mask_to_fixed_size(x, mask, size)[source]#

Pad a feature vector x and the mask to always have the same size.

Parameters:
Returns:

[B, size, D] torch.Tensor: [B, size]

Return type:

torch.Tensor

class kimodo.model.backbone.TransformerEncoderBlock(conf)[source]#

Bases: <Mock object at 0x768058d799f0>[]

__init__(conf)[source]#
forward(
x,
x_pad_mask,
text_feat,
text_feat_pad_mask,
timesteps,
first_heading_angle=None,
)[source]#
Parameters:
  • x (torch.Tensor) – [B, T, dim_motion] current noisy motion

  • x_pad_mask (torch.Tensor) – [B, T] attention mask, positions with True are allowed to attend, False are not

  • text_feat (torch.Tensor) – [B, max_text_len, llm_dim] embedded text prompts

  • text_feat_pad_mask (torch.Tensor) – [B, max_text_len] attention mask, positions with True are allowed to attend, False are not

  • timesteps (torch.Tensor) – [B,] current denoising step

Returns:

[B, T, output_dim]

Return type:

torch.Tensor

class kimodo.model.backbone.PositionalEncoding(d_model, dropout=0.1, max_len=5000)[source]#

Bases: <Mock object at 0x768058dd2c20>[]

Non-learned positional encoding.

__init__(d_model, dropout=0.1, max_len=5000)[source]#
Parameters:
  • d_model (int) – input dim

  • dropout (Optional[float] = 0.1) – dropout probability on output

  • max_len (Optional[int] = 5000) – maximum sequence length

forward(x)[source]#

Apply positional encoding to input sequence.

Parameters:

x (torch.Tensor) – [B, T, D] input motion sequence

Returns:

[B, T, D] input motion with PE added to it (and optionally dropout)

Return type:

torch.Tensor

class kimodo.model.backbone.TimestepEmbedder(latent_dim, sequence_pos_encoder)[source]#

Bases: <Mock object at 0x768058dd08e0>[]

Encoder for diffusion step.

__init__(latent_dim, sequence_pos_encoder)[source]#
Parameters:
  • latent_dim (int) – dim to encode to

  • sequence_pos_encoder (PositionalEncoding) – the PE to use on timesteps

forward(timesteps)[source]#

Embed timesteps by adding PE then going through linear layers.

Parameters:

timesteps (torch.Tensor) – [B]

Returns:

[B, 1, D]

Return type:

torch.Tensor

Classifier-Free Guidance#

Classifier-free guidance wrapper for the denoiser at sampling time.

class kimodo.model.cfg.ClassifierFreeGuidedModel(model, cfg_type='separated')[source]#

Bases: <Mock object at 0x768058c72230>[]

Wrapper around denoiser to use classifier-free guidance at sampling time.

__init__(model, cfg_type='separated')[source]#

Wrap the denoiser for classifier-free guidance; cfg_type in CFG_TYPES (e.g. ‘regular’, ‘nocfg’).

forward(
cfg_weight,
x,
x_pad_mask,
text_feat,
text_feat_pad_mask,
timesteps,
first_heading_angle=None,
motion_mask=None,
observed_motion=None,
cfg_type=None,
)[source]#
Parameters:
  • cfg_weight (float) – guidance weight float or tuple of floats with (text, constraint) weights if using separated cfg

  • x (torch.Tensor) – [B, T, dim_motion] current noisy motion

  • x_pad_mask (torch.Tensor) – [B, T] attention mask, positions with True are allowed to attend, False are not

  • text_feat (torch.Tensor) – [B, max_text_len, llm_dim] embedded text prompts

  • text_feat_pad_mask (torch.Tensor) – [B, max_text_len] attention mask, positions with True are allowed to attend, False are not

  • timesteps (torch.Tensor) – [B,] current denoising step

  • motion_mask

  • observed_motion

  • neutral_joints (torch.Tensor) – [B, nbjoints] The neutral joints of the motions

Returns:

same size as input x

Return type:

torch.Tensor

Model Loading#

Load Kimodo diffusion models from local checkpoints or Hugging Face.

kimodo.model.load_model.load_model(
modelname=None,
device=None,
eval_mode=True,
default_family='Kimodo',
return_resolved_name=False,
)[source]#

Load a kimodo model by name (e.g. ‘g1’, ‘soma’).

Resolution of partial/full names (e.g. Kimodo-SOMA-RP-v1, SOMA) is done inside this function using default_family when the name is not a known short key.

Parameters:
  • modelname – Model identifier; uses DEFAULT_MODEL if None. Can be a short key, a full name (e.g. Kimodo-SOMA-RP-v1), or a partial name; unknown names are resolved via resolve_model_name using default_family.

  • device – Target device for the model (e.g. ‘cuda’, ‘cpu’).

  • eval_mode – If True, set model to eval mode.

  • default_family – Used when modelname is not in AVAILABLE_MODELS to resolve partial names (“Kimodo” for demo/generation, “TMR” for embed script). Default “Kimodo”.

  • return_resolved_name – If True, return (model, resolved_short_key). If False, return only the model.

Returns:

Loaded model in eval mode, or (model, resolved short key) if return_resolved_name is True.

Raises:
  • ValueError – If modelname is not in AVAILABLE_MODELS and cannot be resolved.

  • FileNotFoundError – If config.yaml is missing in the checkpoint folder.

Text Encoder#

Remote text encoder API client (Gradio) for motion generation.

class kimodo.model.text_encoder_api.TextEncoderAPI(url)[source]#

Bases: object

Text encoder API client for motion generation.

to(device=None, dtype=None)[source]#
__call__(texts)[source]#

Encode text prompts into tensors.

Parameters:

texts (str | list[str]) – text prompts to encode

Returns:

encoded text tensors and their lengths

Return type:

tuple[torch.Tensor, list[int]]