DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction

Koki Nagano; Hongyu Liu; Seonwook Park; Tianye Li; Amrita Mazumdar; Christian Jacobsen; Shengze Wang; Michael Stengel; Rajarshi Roy; Ka Chun Cheung; Simon See; Shalini De Mello

DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction

A streaming AI interaction model that listens, speaks, perceives partner motion, and generates synchronized motion in full-duplex dyadic interaction.

Koki Nagano ^1† Hongyu Liu ^1,2*† Seonwook Park ¹ Tianye Li ¹ Amrita Mazumdar ¹ Christian Jacobsen ¹ Shengze Wang ¹ Michael Stengel ¹ Rajarshi Roy ¹ Ka Chun Cheung ¹ Simon See ¹ Shalini De Mello ¹

¹NVIDIA ²HKUST

* Part of the work was done during an internship at NVIDIA. † Joint first authors.

Paper arXiv Abstract BibTeX Video

DyaPlex overview: partner speech and motion as input; the agent listens, backchannels, and responds with synchronized speech and motion; application scenarios for human-agent/robot interaction and synthetic dyadic interaction generation. — **DyaPlex** is a causal, full-duplex speech-motion model that simultaneously speaks and listens to a partner while perceiving partner motion and generating the agent's motion. It can be applied to dyadic interactions with a human user and an agent/robot (right), as well as generating synthetic speech-motion dyadic interaction data.

Video

A narrated walkthrough of DyaPlex. Contains audio; headphones recommended.

Overview

DyaPlex targets natural human-AI interaction, including agents and robots, where speech and body motion are perceived and generated continuously, rather than handled as delayed turn-taking events.

Full-Duplex Simultaneous speech-motion perception and generation

Streaming Causal generation for responsive dyadic interaction

Dual-Tower Frozen speech model coupled with a trainable motion pathway

Dyadic Trained on 4000h of the Seamless Interaction dataset

Abstract

We present DyaPlex, a streaming, full-duplex speech-and-motion model designed for dyadic interaction. To capture the continuous and reciprocal nature of human communication, this full-duplex capability empowers the agent to simultaneously perceive and generate both speech and physical motion in a streaming fashion. At its core, our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction.

Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features. Trained on the 4,000-hour Seamless Interaction dataset, DyaPlex captures cross-speaker dependencies and establishes new state-of-the-art performance across both monadic and dyadic human interaction benchmarks.

Method

Speech-Motion Full Duplex Model

DyaPlex uses a frozen full-duplex speech tower (PersonaPlex) to preserve strong conversational priors, while an RVQ-VAE motion tokenizer and a trainable causal motion tower model both participants in a unified autoregressive sequence.

A time-aligned speech-motion RoPE guides cross-attention so motion tokens attend to the corresponding speech features with explicit temporal structure, enabling streaming full-duplex speech-motion generation without future context.

Results

Applications

A single streaming model can be used in multiple full-duplex interaction applications.

Full-Duplex Human-Robot Interaction

Our streaming architecture enables full-duplex human-robot interaction, where a robot responds to a partner with synchronized speech and gestures.

Full-Duplex Human-Agent Interaction

Given partner's speech, DyaPlex can generate agent's speech and motion in a causal, streaming fashion.

Synthetic Speech-Motion Interaction Data Generation

DyaPlex also unlocks scalable synthetic interaction data generation for training interactive robots and virtual agents. Current video and motion generation methods rely on scripted text and produce only single-person actions — they cannot capture the dynamic, reciprocal nature of real conversations. Our model generates fully synchronized multi-person speech and motion, filling a critical data gap for interactive AI training.

Comparisons

DyaPlex achieves state-of-the-art results on both monadic and dyadic human interaction metrics. We show qualitative comparisons and ablation studies on body motion on Seamless Interaction dataset.

Qualitative Comparison

DyaPlex vs. Baselines

Comparisons between DyaPlex with baselines retrained on Seamless Interaction body motiondataset.

Ablation Study

Effect of Partner Motion Perception

A key advantage of DyaPlex is perceiving the partner's motion. Without it, the model fails to produce reciprocal behaviors like gesture mirroring and socially appropriate visual backchanneling.

Citation

@article{nagano2026dyaplex,
  title={DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction},
  author={Nagano, Koki and Liu, Hongyu and Park, Seonwook and Li, Tianye and
          Mazumdar, Amrita and Jacobsen, Christian and Wang, Shengze and
          Stengel, Michael and Roy, Rajarshi and Cheung, Ka Chun and
          See, Simon and De Mello, Shalini},
  journal={arXiv preprint arXiv:2606.03874},
  year={2026}
}