NVIDIA Research
CALM: Conditional Adversarial Latent Models for Directable Virtual Characters

CALM: Conditional Adversarial Latent Models for Directable Virtual Characters

2Technion - Israel Institute of Technology
3Bar-Ilan University
4Simon Fraser University

Our framework enables users to direct the behavior of a physically simulated character using demonstrations encoded in the form of low-dimensional latent embeddings of motion capture data. In this example, the character is instructed to crouch-walk towards a target, kick when within range, and finally raise its arms and celebrate.


In this work, we present Conditional Adversarial Latent Models (CALM), an approach for generating diverse and directable behaviors for user-controlled interactive virtual characters. Using imitation learning, CALM learns a representation of movement that captures the complexity and diversity of human motion, and enables direct control over character movements. The approach jointly learns a control policy and a motion encoder that reconstructs key characteristics of a given motion without merely replicating it. The results show that CALM learns a semantic motion representation, enabling control over the generated motions and style-conditioning for higher-level task training. Once trained, the character can be controlled using intuitive interfaces, akin to those found in video games.


To achieve zero-shot task solutions, CALM consists of 3 phases. (1) A motion encoder and a low-level policy (decoder) are jointly trained to map from a motion capture sequence into actions controlling the simulated character. (2) A high-level policy is trained using latent space conditioning, to enable control over the direction in which a motion is performed, while retaining the requested style. (3) Steps 1 and 2 are combined using a simple finite-state-machine in order to solve tasks without further training and without meticulous reward/termination design.

Phase 1: Low-level Training

During low-level training, CALM learns an encoder and a decoder. The encoder takes a motion from a reference dataset of motions, a time-series of joint locations, and maps it into to a low-dimensional latent representation. Additionally, CALM also jointly learns a decoder. The decoder is a low-level policy that interacts with the simulator and generates motions similar to the reference dataset. This policy produces a variety of behaviors on demand, but is not conditioned on the directionality of the motion. For example, it can be instructed to walk, but does not enable intuitive control over the direction of walking.

Meaningful Motion Representations

To evaluate the learned motion representation, we test the ability to interpolate between motions in the latent space. Here, the initial latent is the latent representation for sprint. The final latent is that of crouching idle. Throughout the episode, the latent is linearly interpolated over time, going from spring towards crouch-idle. The character smoothly transitions through semantically meaningful transitions, gradually reducing speed and tilting the upper body.

Phase 2: Directionality Control

To control motion direction, we train a high-level task-driven policy to select latent variables. These latents are provided to the low-level policy which generates the requested motion. Here, the learned motion representation enables a form of style-conditioning. To achieve this, the motion encoder is used to obtain the latent representation of the requested motion. The high-level policy is then provided an additional reward proportional to the cosine distance between the selected latents and the latent representing the requested style, thus guiding the high-level policy to adopt a desired behavioral style. For example, here a directionality-controller is trained to enable control over the form of loco-motion performed and the direction in which the character performs it -- crouch-walk, walk shield-up, and run.

Phase 3: Inference

Finally, the previously trained models (low-level policy and directional controller) are combined to compose complex movements without additional training. To do so, the user produces a finite-state machine (FSM) containing standard rules and commands. These determine which motion to perform, similar to how a user controls a video game character. For example, they determine whether the character should perform a simple motion, performed directly using the low-level policy, or a directed motion requiring high-level control. As an example, one may construct an FSM like (a) "crouch-walk towards the target, until distance < 1m", then (b) "kick", and finally (c) "celebrate".


                author = {
                        Tessler, Chen
                        and Kasten, Yoni
                        and Guo, Yunrong
                        and Mannor, Shie
                        and Chechik, Gal
                        and Peng, Xue Bin},
                title = {CALM: Conditional Adversarial Latent Models for Directable Virtual Characters},
                year = {2023},
                isbn = {9798400701597},
                publisher = {Association for Computing Machinery},
                address = {New York, NY, USA},
                url = {https://doi.org/10.1145/3588432.3591541},
                doi = {10.1145/3588432.3591541},
                booktitle = {ACM SIGGRAPH 2023 Conference Proceedings},
                keywords = {
                    reinforcement learning,
                    animated character control,
                    adversarial training,
                    motion capture data
                location = {Los Angeles, CA, USA},
                series = {SIGGRAPH '23}


CALM: Conditional Adversarial Latent Models for Directable Virtual Characters

Chen Tessler, Yoni Kasten, Yunrong Guo, Shie Mannor, Gal Chechik, and Xue Bin Peng

description arXiv version
description Video
insert_comment BibTeX