Video
A narrated walkthrough of DyaPlex. Contains audio; headphones recommended.
Overview
DyaPlex targets natural human-AI interaction, including agents and robots, where speech and body motion are perceived and generated continuously, rather than handled as delayed turn-taking events.
Abstract
We present DyaPlex, a streaming, full-duplex speech-and-motion model designed for dyadic interaction. To capture the continuous and reciprocal nature of human communication, this full-duplex capability empowers the agent to simultaneously perceive and generate both speech and physical motion in a streaming fashion. At its core, our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction.
Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features. Trained on the 4,000-hour Seamless Interaction dataset, DyaPlex captures cross-speaker dependencies and establishes new state-of-the-art performance across both monadic and dyadic human interaction benchmarks.
Method
Speech-Motion Full Duplex Model
DyaPlex uses a frozen full-duplex speech tower (PersonaPlex) to preserve strong conversational priors, while an RVQ-VAE motion tokenizer and a trainable causal motion tower model both participants in a unified autoregressive sequence.
A time-aligned speech-motion RoPE guides cross-attention so motion tokens attend to the corresponding speech features with explicit temporal structure, enabling streaming full-duplex speech-motion generation without future context.
Results
Applications
A single streaming model can be used in multiple full-duplex interaction applications.
Full-Duplex Human-Robot Interaction
Our streaming architecture enables full-duplex human-robot interaction, where a robot responds to a partner with synchronized speech and gestures.
Full-Duplex Human-Agent Interaction
Given partner's speech, DyaPlex can generate agent's speech and motion in a causal, streaming fashion.
Synthetic Speech-Motion Interaction Data Generation
DyaPlex also unlocks scalable synthetic interaction data generation for training interactive robots and virtual agents. Current video and motion generation methods rely on scripted text and produce only single-person actions — they cannot capture the dynamic, reciprocal nature of real conversations. Our model generates fully synchronized multi-person speech and motion, filling a critical data gap for interactive AI training.
Comparisons
DyaPlex achieves state-of-the-art results on both monadic and dyadic human interaction metrics. We show qualitative comparisons and ablation studies on body motion on Seamless Interaction dataset.
Qualitative Comparison
DyaPlex vs. Baselines
Comparisons between DyaPlex with baselines retrained on Seamless Interaction body motiondataset.
Ablation Study
Effect of Partner Motion Perception
A key advantage of DyaPlex is perceiving the partner's motion. Without it, the model fails to produce reciprocal behaviors like gesture mirroring and socially appropriate visual backchanneling.
Citation
@article{nagano2026dyaplex,
title={DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction},
author={Nagano, Koki and Liu, Hongyu and Park, Seonwook and Li, Tianye and
Mazumdar, Amrita and Jacobsen, Christian and Wang, Shengze and
Stengel, Michael and Roy, Rajarshi and Cheung, Ka Chun and
See, Simon and De Mello, Shalini},
journal={arXiv preprint arXiv:2606.03874},
year={2026}
}