NVIDIA PersonaPlex: Natural Conversational AI With Any Role and Voice

Published:

Authors

Rajarshi Roy, Jonathan Raiman, Sang-gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, Bryan Catanzaro.

PersonaPlex and Rajarshi Roy sharing jokes.

Conversational AI has forced an impossible choice. Traditional systems (ASR→LLM→TTS cascades) let you customize the voice and role, but conversations feel robotic with awkward pauses, no interruptions, and unnatural turn-taking. Full-duplex models like Moshi finally made AI conversations feel natural with real-time listening and speaking, but locked you into a single fixed voice and role. NVIDIA PersonaPlex breaks this trade-off. Select from a diverse range of voices and define any role through text prompts. Need a wise assistant, a customer service agent, a fantasy character, or just someone to talk to? PersonaPlex delivers truly natural conversations while maintaining your chosen persona throughout. It handles interruptions, backchannels, and authentic conversational rhythm. For the first time, you get both the customization you need and the naturalness that makes conversations feel genuinely human.

Capabilities

Full Duplex

PersonaPlex is a full duplex model: it listens and speaks at the same time. This capability, first introduced with Moshi, lets PersonaPlex learn not only the contents of its speech but also the behavior associated with speech, such as when to pause, interrupt, or backchannel (“uh-huh”, “oh”, etc.). We achieve low-latency interaction by eliminating delays associated with cascaded systems that use separate models for listening (Automated Speech Recognition), language production (Language Model), and speaking (Text to Speech). Our approach uses a single model that updates its internal state as the user speaks and streams a response back immediately.

Enriching PersonaPlex’s output with non-verbal aspects creates an important qualitative difference relative to systems without this dimension: PersonaPlex now recreates some of the same cues humans use to read intent, emotions, or comprehension.

Examples

The following examples showcase PersonaPlex’s behavior across different scenarios. In all audio files, you can hear the user speaking in the left channel and PersonaPlex in the right channel (shown in green).

  1. Assistant

    Prompt: You are a wise and friendly teacher. Answer questions or provide advice in a clear and engaging way.

    In this example from FullDuplexBench's interruption evaluation, PersonaPlex demonstrates general knowledge, interrupterability, and natural turn taking.

  2. Customer Service - Banking

    Prompt: You work for First Neuron Bank which is a bank and your name is Sanni Virtanen. Information: The customer's transaction for $1,200 at Home Depot was declined. Verify customer identity. The transaction was flagged due to an unusual location (transaction attempted in Miami, FL; customer normally transacts in Seattle, WA).

    PersonaPlex demonstrates instruction following from text prompt, empathy, listening while talking and accent control using voice prompting.

  3. Customer Service - Medical Office Reception

    Prompt: You work for Dr. Jones's medical office, and you are receiving calls to record information for new patients. Information: Record full name, date of birth, any medication allergies, tobacco smoking history, alcohol consumption history, and any prior medical conditions. Assure the patient that this information will be confidential, if they ask.

    PersonaPlex demonstrates instruction following from text prompt, and registration of important user details from user speech.

  4. Natural Backchanneling

    Prompt: You enjoy having a good conversation.

    In this example from FullDuplexBench's backchanneling evaluation, PersonaPlex produces a variety of conversational backchannels like "oh okay", "okay", "yeah", "yeah, I think they do" that signal active listening without interrupting the speaker's flow. The backchannels are contextual in content and tone.

  5. Space Emergency Scenario

    Prompt: You enjoy having a good conversation. Have a technical discussion about fixing a reactor core on a spaceship to Mars. You are an astronaut on a Mars mission. Your name is Alex. You are already dealing with a reactor core meltdown on a Mars mission. Several ship systems are failing, and continued instability will lead to catastrophic failure. You explain what is happening and you urgently ask for help thinking through how to stabilize the reactor.

    PersonaPlex demonstrates strong generalization to text prompts well outside its training distribution (assistant, customer service, open-ended casual conversations). It maintains a persona coherent with the text prompt throughout the extended interaction while exhibiting appropriate tones of stress and urgency befitting the emergency scenario.

Architecture

PersonaPlex uses two inputs to define conversational behavior:

  • Voice prompt: An audio embedding that captures vocal characteristics, speaking style, and prosody.
  • Text prompt: Natural language describing the role, background information, and conversation context. These inputs are processed jointly to create a coherent persona.
<PAD> <PAD> Pause Hello Generation <PAD> Temporal Transformer Hello <PAD> How User Audio Agent Text Agent Audio Mimi - Neural Audio Codec <PAD> <PAD> Voice Prompt <system> You are <system> Text Prompt <PAD> <PAD> Pause Depth Transformer Sine Wave Silence User Microphone Generated Audio Speaker Sample Input Channels Hybrid System Prompt

Hybrid Prompting Architecture.

PersonaPlex is built on the Moshi architecture from Kyutai, with 7 billion parameters:

  • Mimi speech encoder (ConvNet + Transformer) converts audio to tokens
  • Temporal and depth transformers process the conversation
  • Mimi speech decoder (Transformer + ConvNet) generates output speech Audio operates at 24kHz sample rate. The dual-stream configuration allows listening and speaking to occur concurrently, enabling natural conversational dynamics. The underlying language model is Helium, which provides semantic understanding and enables generalization to out-of-distribution scenarios.

Training data

A challenge faced during the design of PersonaPlex is the lack of conversational speech data covering a broad range of topics and emotions, containing a wide range of non-verbal behavior such as interruptions, backchannels, pauses. Supervising PersonaPlex’s full-duplex poses another difficulty: training data must contain multiple speakers talking and each speaker’s audio must be separated from the rest.

To address this, we found that a limited set of unscripted human conversations from the Fisher English corpus can be transformed into persona-supervised data by using an LLM to retrospectively generate contextual and personality descriptors for each speaker. To further expand coverage across scenarios and topics, we also use language models to generate dialogues and personality prompts, which are then synthesized into audio using Chatterbox TTS. PersonaPlex trains on a blend of these conversations in a single stage.

Real conversations

In order to learn natural backchanneling, expressions and emotional response, PersonaPlex trains on 7,303 real conversations (1217 hours) from the Fisher English corpus. The conversations are back-annotated with prompts using GPT-OSS-120B. The prompts have various levels of details in order to balance between generalization and instruction following ability as showcased in the example below:

  • “You enjoy having a good conversation.”
  • “You enjoy having a good conversation. Have a casual discussion about eating at home versus dining out.”
  • “You enjoy having a good conversation. Have a reflective conversation about career changes and feeling of home. You have lived in California for 21 years and consider San Francisco your home. You work as a teacher and have traveled a lot. You dislike meetings.”

Synthetic conversations for assistant and customer service roles

PersonaPlex was trained on 39,322 synthetic assistant role conversations (410 hours) and 105,410 synthetic customer service conversations (1,840 hours). The conversation transcripts were generated using Qwen3-32B and GPT-OSS-120B. The conversation speech was generated using Chatterbox TTS.

For question-answering assistant scenarios, we vary the user and agent voices as well as the conversation content, while using a fixed text prompt for all assistant interactions:

You are a wise and friendly teacher. Answer questions or provide advice in a clear and engaging way.

For customer service scenarios, along with varying voices and content, we provide text prompts that contain all the relevant information necessary for the agent’s role, such as organization name, role type, name, and additional context (pricing, hours, rules, etc.):

  • “You work for CitySan Services which is a waste management and your name is Ayelen Lucero. Information: Verify customer name Omar Torres. Current schedule: every other week. Upcoming pickup: April 12th. Compost bin service available for $8/month add-on.”
  • “You work for Jerusalem Shakshuka which is a restaurant and your name is Owen Foster. Information: There are two shakshuka options: Classic (poached eggs, $9.50) and Spicy (scrambled eggs with jalapenos, $10.25). Sides include warm pita ($2.50) and Israeli salad ($3). No combo offers. Available for drive-through until 9 PM.”
  • “You work for AeroRentals Pro which is a drone rental company and your name is Tomaz Novak. Information: AeroRentals Pro has the following availability: PhoenixDrone X ($65/4 hours, $110/8 hours), and the premium SpectraDrone 9 ($95/4 hours, $160/8 hours). Deposit required: $150 for standard models, $300 for premium.”

The synthetic data enables task-following behavior while the real conversations from the Fisher English corpus present varied natural interaction patterns that current TTS systems cannot simulate reliably. We use the same format of text and voice prompt across real and synthetic data sources to maximize the model’s ability to disentangle the qualities of either data source and combine them.

Key findings

We make several observations from our PersonaPlex training experiments:

  1. Efficient specialization from pretrained foundations – Starting from Moshi’s pretrained weights, under 5,000 hours of directed data enables task-following. The pretrained Moshi model already demonstrated broad conversational competence, and fine-tuning appears to retain those skills while adding the ability to guide the behavior using a prompt.

  2. Disentangled speech naturalness and task-adherence – Synthetic training data covered a wide range of personalities and contexts in the text prompts and dialogues but the synthesized audio did not showcase the behavioral richness and realism of real recordings. The Fisher conversations have limited domains and diversity in the text prompts and undirected dialogues, however the recordings contain a wide range of speech patterns. The final model exhibits the speech patterns from Fisher, along with the task-adherence from the synthetic data. Data blending these sources lets us use the shared hybrid prompt and the voice conditioning as a bridge between task knowledge and natural interaction patterns.

  3. Emergent generalization beyond training domains – In the experiments documented in the paper, we test the model’s ability to handle new situations and contexts and find that it can retrieve and use the information in its context to respond to new scenarios. While PersonaPlex only covers service and assistant scenarios, examples such as the astronaut example above demonstrate a generalization outside the predefined service or assistant settings: the model handles technical crisis management vocabulary, appropriate emotional urgency, and domain-specific reasoning about reactor physics - none of which appeared in training data. We suspect that such generalization is a leftover from the broad corpus used for pretraining Moshi’s language model Helium.

Evaluation

As measured on conversational AI benchmarks and our customer service benchmark, PersonaPlex outperforms other open-source and commercial systems on conversational dynamics, response and interruption latency, and task adherence in both question-answering assistant and customer service roles.

To quantify how PersonaPlex compares with other conversational AI agents, we evaluate PersonaPlex using FullDuplexBench, a well-established benchmark for conversational AI. FullDuplexBench evaluates conversational dynamics metrics for turn-taking, user interruption, and pause handling. It also evaluates the quality of agent responses using GPT-4o as a judge. Since FullDuplexBench evaluates the content of agent responses only on questions intended for a generic question-answering assistant role, we extend FullDuplexBench to cover various customer service roles and scenarios. This extension, which we call ServiceDuplexBench, allows us to evaluate task adherence across real-world scenarios.

Availability

Code and model weights are released under MIT License and NVIDIA Open Model License respectively. The base Moshi model is licensed CC-BY-4.0 from Kyutai.

The ServiceDuplexBench benchmark will be released in the near future.

Acknowledgments

PersonaPlex builds on Moshi from Kyutai. This work was enabled by their open-source release.

Citation

If you use PersonaPlex in your research, please cite our paper: [BibTeX coming soon]