Model#

Core model architecture, diffusion logic, and text encoders.

Kimodo Model#

Kimodo model: denoiser, text encoder, diffusion sampling, and post-processing.

class kimodo.model.kimodo_model.Kimodo( denoiser, text_encoder, num_base_steps, device=None, cfg_type='separated', )[source]#

Bases: <Mock object at 0x76d7691c2110>[]

Helper class for test time.

property output_skeleton#: Skeleton used for model output (somaskel77 for SOMA, else unchanged).

train(mode)[source]#

eval()[source]#

denoising_step( motion, pad_mask, text_feat, text_pad_mask, t, first_heading_angle, motion_mask, observed_motion, num_denoising_steps, cfg_weight, guide_masks=None, cfg_type=None, )[source]#

Single denoising step.

Returns:: [B, T, D] noisy motion input to t-1
Return type:: torch.Tensor

__call__( prompts, num_frames, num_denoising_steps, multi_prompt=False, constraint_lst=[], cfg_weight=[2.0, 2.0], num_samples=None, cfg_type=None, return_numpy=False, first_heading_angle=None, num_transition_frames=5, post_processing=False, root_margin=0.04, progress_bar=tqdm.auto.tqdm, )[source]#

Generate motion from text prompts and optional kinematic constraints.

When a single prompt/num_frames pair is given, one motion is generated. Passing lists of prompts and/or num_frames produces a batch of independent motions. With multi_prompt=True, the prompts are treated as sequential segments that are generated and stitched together with smooth transitions.

Parameters:

prompts – One or more text descriptions of the desired motion. A single string generates one sample; a list generates a batch (or sequential segments when multi_prompt=True).
num_frames – Duration of the generated motion in frames. Can be a single int applied to every prompt or a per-prompt list.
num_denoising_steps – Number of DDIM denoising steps. More steps generally improve quality at the cost of speed.
multi_prompt – If True, treat prompts as an ordered sequence of segments and concatenate them with transitions.
constraint_lst – Per-sample list of kinematic constraints (e.g. keyframe poses, end-effector targets, 2-D paths). Pass an empty list for unconstrained generation.
cfg_weight – Classifier-free guidance scale(s). A two-element list [text_cfg, constraint_cfg] controls text and constraint guidance independently.
num_samples – Number of samples to generate.
cfg_type – Override the default CFG strategy set at init (e.g. "separated").
return_numpy – If True, convert all output tensors to numpy arrays.
first_heading_angle – Initial body heading in radians. Shape (B,) or scalar. Defaults to 0 (facing +Z).
num_transition_frames – Number of overlapping frames used to blend consecutive segments in multi-prompt mode.
post_processing – If True, apply post-processing (foot-skate cleanup and constraint enforcement).
root_margin – Horizontal margin (in meters) used by the post-processor to determine when to correct root motion. When root deviates more than margin from the constraint, the post-processor will correct it.
progress_bar – Callable wrapping an iterable to display progress (default: tqdm). Pass a no-op to silence output.

Returns:

A dictionary of motion tensors (or numpy arrays if return_numpy=True) with the following keys:

local_rot_mats – Local joint rotations as rotation matrices.
global_rot_mats – Global joint rotations as rotation matrices.
posed_joints – Joint positions in world space.
root_positions – Root joint positions.
smooth_root_pos – Smoothed root trajectory.
foot_contacts – Boolean foot-contact labels [left heel, left toe, right heel, right toe].
global_root_heading – Root heading angle over time.

Return type:

dict

Denoiser and Backbone#

Two-stage transformer denoiser: root stage then body stage for motion diffusion.

class kimodo.model.twostage_denoiser.TwostageDenoiser(

motion_rep,

motion_mask_mode,

ckpt_path=None,

**kwargs,

)[source]#

Bases: <Mock object at 0x76d76906c520>[]

Two-stage denoiser: first predicts global root features, then body features conditioned on local root.

__init__(

motion_rep,

motion_mask_mode,

ckpt_path=None,

**kwargs,

)[source]#: Build root and body transformer blocks; optionally load checkpoint from ckpt_path.

load_ckpt(ckpt_path)[source]#: Load checkpoint from path; state dict keys are stripped of ‘denoiser.backbone.’ prefix.

forward( x, x_pad_mask, text_feat, text_feat_pad_mask, timesteps, first_heading_angle=None, motion_mask=None, observed_motion=None, )[source]#

Parameters:

x (torch.Tensor) – [B, T, dim_motion] current noisy motion
x_pad_mask (torch.Tensor) – [B, T] attention mask, positions with True are allowed to attend, False are not
text_feat (torch.Tensor) – [B, max_text_len, llm_dim] embedded text prompts
text_feat_pad_mask (torch.Tensor) – [B, max_text_len] attention mask, positions with True are allowed to attend, False are not
timesteps (torch.Tensor) – [B,] current denoising step
motion_mask
observed_motion

Returns:

same size as input x

Return type:

torch.Tensor

Transformer backbone: padding, masking, and encoder stack for the denoiser.

kimodo.model.backbone.pad_x_and_mask_to_fixed_size(x, mask, size)[source]#

Pad a feature vector x and the mask to always have the same size.

Parameters:

x (torch.Tensor) – [B, T, D]
mask (torch.Tensor) – [B, T]
size (int)

Returns:

[B, size, D] torch.Tensor: [B, size]

Return type:

torch.Tensor

class kimodo.model.backbone.TransformerEncoderBlock(conf)[source]#

Bases: <Mock object at 0x76d7690971f0>[]

__init__(conf)[source]#

forward( x, x_pad_mask, text_feat, text_feat_pad_mask, timesteps, first_heading_angle=None, )[source]#

Parameters:

x (torch.Tensor) – [B, T, dim_motion] current noisy motion
x_pad_mask (torch.Tensor) – [B, T] attention mask, positions with True are allowed to attend, False are not
text_feat (torch.Tensor) – [B, max_text_len, llm_dim] embedded text prompts
text_feat_pad_mask (torch.Tensor) – [B, max_text_len] attention mask, positions with True are allowed to attend, False are not
timesteps (torch.Tensor) – [B,] current denoising step

Returns:

[B, T, output_dim]

Return type:

torch.Tensor

class kimodo.model.backbone.PositionalEncoding(d_model, dropout=0.1, max_len=5000)[source]#

Bases: <Mock object at 0x76d7691c4430>[]

Non-learned positional encoding.

__init__(d_model, dropout=0.1, max_len=5000)[source]#

Parameters:

d_model (int) – input dim
dropout (Optional[float] = 0.1) – dropout probability on output
max_len (Optional[int] = 5000) – maximum sequence length

forward(x)[source]#

Apply positional encoding to input sequence.

Parameters:: x (torch.Tensor) – [B, T, D] input motion sequence
Returns:: [B, T, D] input motion with PE added to it (and optionally dropout)
Return type:: torch.Tensor

class kimodo.model.backbone.TimestepEmbedder(latent_dim, sequence_pos_encoder)[source]#

Bases: <Mock object at 0x76d7692115a0>[]

Encoder for diffusion step.

__init__(latent_dim, sequence_pos_encoder)[source]#

Parameters:

latent_dim (int) – dim to encode to
sequence_pos_encoder (PositionalEncoding) – the PE to use on timesteps

forward(timesteps)[source]#

Embed timesteps by adding PE then going through linear layers.

Parameters:: timesteps (torch.Tensor) – [B]
Returns:: [B, 1, D]
Return type:: torch.Tensor

Classifier-Free Guidance#

Classifier-free guidance wrapper for the denoiser at sampling time.

class kimodo.model.cfg.ClassifierFreeGuidedModel(model, cfg_type='separated')[source]#

Bases: <Mock object at 0x76d7c177df60>[]

Wrapper around denoiser to use classifier-free guidance at sampling time.

__init__(model, cfg_type='separated')[source]#: Wrap the denoiser for classifier-free guidance; cfg_type in CFG_TYPES (e.g. ‘regular’, ‘nocfg’).

forward( cfg_weight, x, x_pad_mask, text_feat, text_feat_pad_mask, timesteps, first_heading_angle=None, motion_mask=None, observed_motion=None, cfg_type=None, )[source]#

Parameters:

cfg_weight (float) – guidance weight float or tuple of floats with (text, constraint) weights if using separated cfg
x (torch.Tensor) – [B, T, dim_motion] current noisy motion
x_pad_mask (torch.Tensor) – [B, T] attention mask, positions with True are allowed to attend, False are not
text_feat (torch.Tensor) – [B, max_text_len, llm_dim] embedded text prompts
text_feat_pad_mask (torch.Tensor) – [B, max_text_len] attention mask, positions with True are allowed to attend, False are not
timesteps (torch.Tensor) – [B,] current denoising step
motion_mask
observed_motion
neutral_joints (torch.Tensor) – [B, nbjoints] The neutral joints of the motions

Returns:

same size as input x

Return type:

torch.Tensor

Model Loading#

Load Kimodo diffusion models from local checkpoints or Hugging Face.

kimodo.model.load_model.load_model( modelname=None, device=None, eval_mode=True, default_family='Kimodo', return_resolved_name=False, text_encoder=None, text_encoder_fp32=False, )[source]#

Load a kimodo model by name (e.g. ‘g1’, ‘soma’).

Resolution of partial/full names (e.g. Kimodo-SOMA-RP-v1, SOMA) is done inside this function using default_family when the name is not a known short key.

Parameters:

modelname – Model identifier; uses DEFAULT_MODEL if None. Can be a short key, a full name (e.g. Kimodo-SOMA-RP-v1), or a partial name; unknown names are resolved via resolve_model_name using default_family.
device – Target device for the model (e.g. ‘cuda’, ‘cpu’).
eval_mode – If True, set model to eval mode.
default_family – Used when modelname is not in AVAILABLE_MODELS to resolve partial names (“Kimodo” for demo/generation, “TMR” for embed script). Default “Kimodo”.
return_resolved_name – If True, return (model, resolved_short_key). If False, return only the model.
text_encoder – Pre-built text encoder to reuse. When provided, skips text encoder selection/instantiation entirely.
text_encoder_fp32 – If True, uses fp32 for the text encoder rather than default bfloat16.

Returns:

Loaded model in eval mode, or (model, resolved short key) if return_resolved_name is True.

Raises:

ValueError – If modelname is not in AVAILABLE_MODELS and cannot be resolved.
FileNotFoundError – If config.yaml is missing in the checkpoint folder.

Text Encoder#

Remote text encoder API client (Gradio) for motion generation.

class kimodo.model.text_encoder_api.TextEncoderAPI(url)[source]#

Bases: object

Text encoder API client for motion generation.

to(device=None, dtype=None)[source]#

__call__(texts)[source]#

Encode text prompts into tensors.

Parameters:: texts (str | list[str]) – text prompts to encode
Returns:: encoded text tensors and their lengths
Return type:: tuple[torch.Tensor, list[int]]