RADMMM: Multilingual Multiaccented Multispeaker TTS with RADTTS

Published:

Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro

Overview

In our recent paper, we present a multilingual speech synthesis system that can make a speaker speak languages other than their native language without relying on bi-lingual (or multi-lingual) data. This is challenging to do because it is expensive to obtain bilingual training data, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities.

Our model, RADMMM, is based on RADTTS and RADTTS++ with explicit control over accent, language, speaker and fine-grained $F_0$ and energy features. The sequence of input tokens determines the language and we define accent as how those tokens are pronounced by a speaker. We demonstrate an ability to control synthesized language and accent for any speaker in an open-source dataset comprising of 7 accents. Human subjective evaluation demonstrates that our model can better retain a speaker’s voice and accent quality than controlled baselines while synthesizing fluent speech in all target languages and accents in our dataset.

The following are key contributions of our paper:

  • Demonstrate effective scaling of single language TTS to multiple languages using a shared alphabet set and alignment learning framework.
  • Introduce explicit accent conditioning to control the synthesized accent.
  • Explore fine-grained control of speech attributes such as $F_0$ and energy and its effects on speaker timbre retention and accent quality.
  • Propose and analyze several strategies to disentangle attributes (speaker, accent, language and text) without relying on bi-lingual (or multilingual) training data.

The following demonstrates 44khz samples of our model making an english speaker speak multiple languages:

Description Sample
Speaker's Resynthesized Voice (English)
Speaker's Synthetic Voice (Hindi)
Speaker's Synthetic Voice (German)

We utilize RADMMM and apply our approach on the Hindi, Telugu and Marathi dataset as part of the ICASSP Signal Processing Grand Challenge (LIMMITS 2023). This challenge involves both mono-lingual evaluation and cross-lingual evaluation in these 3 languages and RADMMM stood 1st in terms of speaker identity retention and 2nd in terms of overall mean opinion score out of 38 teams (across the world) that participated in the challenge. (RADMMM is referred to as VAANI in Track1 results available in challenge link above)

Key Challenges

Different alphabet sets: every language has its own alphabet, which requires a text processing system that can represent text from every language. Training a model with each language having a different alphabet set has 2 disadvantages: (a) it makes speakers and characters from that language entangled and (b) its harder to scale to new languages (requires retraining several components of the model like encoder, aligner etc)

Speech-text alignment: Most TTS systems rely on external aligners. With such dependencies, scaling to mulitple languages can be challenging because it requires training or finding aligners for every language.

Disentanglement of attributes: Most speakers speak a single language and obtaining bi-lingual (or multi-lingual) data for a speaker can be very challenging (and expensive) due to which TTS datasets generally have strong association between language, speaker, accent and spoken text. Even if it was possible to obtain multi-lingual data for a speaker, the data will have accented speech in languages other than speaker’s native language. This has its own challenges where speech in every language has multiple unknown accents.

Desirable Characteristics of multi-lingual model

Our main goal is to make a speaker speak, with a native accent, languages other than the language they speak in the training data.

Controllable Accent: Generally, we desire to synthesize speech in the target language with the native accent of the target language. When we transfer the speaker to a new language, we should also be able to change their accent and synthesize in native accent of target language.

Fine-grained control: Generally, TTS models don’t support finer control of speech characteristics like fundamental frequency ($F_0$) and emphasis. Ability to control such attributes for desired language, accent and speaker can be useful for customers to better control the synthetic speech.

Scaling to any number of languages: We desire a single model that easily scales to many languages.

Method

The following figure covers our model at a high level.

RADMMMOverview

We propose the following methods to address the above challenges:

1. One Alphabet Set to Rule All Languages:

Each language generally has independent alphabet set, although some languages may share alphabet and words(eg Spanish and Portuguese). A shared alphabet set like IPA is beneficial to simplify text processing and support easy scaling to new languages. Sharing the text tokens reduces entanglement between symbols and speakers and by design supports code-switching (mixed languages within same prompt). We construct our own symbol set based on the IPA symbols and treat markers, phonemes and punctuations as seperate symbols. Its important to note that the sequence of these symbols determines the language being spoken but the symbols themselves are shared across all languages. For grapheme to IPA phoneme conversion we use Phonemizer, which supports many languages.

2. One Alignment Mechanism to Rule All Languages:

By using a shared alphabet set and our past work on One TTS Alignment, we learn alignments online in a unsupervised way without any dependencies. This allows us to train as many languages as possible in a single model. In our limited experimentation, we have also seen that adding new languages to the model doesn’t require any modifications, except for some finetuning with the new language added to the dataset.

Since speakers can have thick accents, and accent affects the pronunciation of the same text token, we condition alignment learning on text as well as accent to handle the fact that same text can be pronounced in different ways due to accent.

3. Strategies for disentanglement of attributes:

We apply the following strategies to disentangle attributes (accent, speaker, language) such that we can achieve better synthesis quality, specially when synthesizing for unseen combinations of these attributes. Since our goal is to not rely on bi-lingual training data, every speaker in the dataset speaks a single language with their native accent.

Embedding regularization:

We maximize the variance of embeddings and minimize the covariance to decorrelate the dimensions and enhance information captured in each dim(following VicReg). Although this decorrelates the dimensions of each embedding, there can be correlation between acccent and speaker embeddings. Hence, we apply an additional cross-correlation minimization constraint between accent and speaker embeddings. More details can be seen in our paper.

Adversarial loss to disentangle text and spk:

Since our dataset has a strong correlation between the sequence of tokens and the speaker, there can be entanglement between these attributes. We apply an adversarial loss by using a gradient reversal layer with a speaker classifier and backpropagating negative gradients to the text encoders. More details can be seen in our paper.

Data Augmentation:

We use data augmentations like formant, $F_0$, and duration scaling to promote disentanglement between speaker and accent. By applying augmentation, we formulate a new speaker that has different voice due to changed fundamental frequency, rhythm or scaled formants. This way, we create multiple speakers per accent to disentangle speaker and accent.

4. Fine-grained control over speech characteristics:

We condition our model on fine-grained attributes like $F_0$ and energy to help improve speech quality alongwith better accent and language transfer. During training, we condition our mel decoder on ground truth frame-level $F_0$ and energy. We train deterministic attribute predictors to predict phoneme durations, $F_0$, and energy conditioned on speaker, encoded text, and accent.

Dataset Details

We curated an open source dataset with 16khz sampling rate. We restrict the maximum number of training samples per speaker to 10,000 to better balance the data. Train/Test splits will be released with source code. This dataset emulates limited data regime, where we only have 1 speaker per language and hence disentanglement of speaker and accent can be more challenging. The following table summarizes the datasets and contains the open-source links to the individual data. All models use speaker specific HifiGAN vocoders that were trained on ground-truth data per speaker.

DatasetDetails

Results and Samples

We evaluate the proposed methods against [Tactron2 Multilingual] on our data setup described above. Following summarizes the results. We showcase samples with LJS (english speaker) speaking in hindi and a hindi speaker(IndicTTS) speaking in English.

Human Evaluation

We conduct 2 human evaluations with native speakers of each language. We refer to RADTTS (without disentanglement strategies) as RT Base, RADTTS (with disentanglement strategies) as RT Final, RADMMM (fine-grained f0, energy control without disentanglement strategies) as RM Base, RADMMM (fine-grained f0, energy control with disentanglement strategies) as RM Final, RADMMM (fine-grained f0, energy control with disentanglement strategies) synthesized in speaker’s original accent as RM Accented and Tacotron2 Multilingual trained on our data as T2.

Accent Evaluation

We conduct an anchored preference test by showing 3 samples to a native speaker - an accent reference that demonstrates a native speaking the prompt, and 1 sample each from a model. The task is to choose the sample from the model that has a better native accent. The following table demonstrates the results obtained for each language for different model pairs. A pairwise preference score above 0.0 indicates modelA was preferred over model B in accent quality where x-axis denotes ModelA vs ModelB pairs:

AccentResults

Speaker Identity Evaluation

We conduct an anchored preference test by showing native speakers of a language 3 samples - a speaker reference that demonstrates the voice of the speaker, and 1 samples each from a model for that speaker in another language. The task is to choose the sample from the model that preserves the speaker identity better. The following table demonstrates the results obtained for each model pair considering all languages. A pairwise preference score above 0.0 indicates modelA was preferred over model B in accent quality where 1st column denotes ModelA vs ModelB pairs:

SpeakerResults

The following tables demonstrate the samples.

Speaker references

This table contains the original voice representing the speaker identity references for Hindi (IndicTTS) and English(LJS) speakers

Hindi Speaker (IndicTTS) English Speaker (LJS)

Samples comparing Tacotron2 vs RADMMM

Sample type Tacotron2 Multilingual RADMMM
Hindi speaker (IndicTTS) speaking in English
English speaker (LJS) speaking in Hindi

Samples comparing RADMMM (fine-grained control over $F_0$ and energy) without disentanglement vs RADMMM with disentanglement

Sample type RADMMM without disentanglement RADMMM with disentanglement
Hindi speaker (IndicTTS) speaking in English
English speaker (LJS) speaking in Hindi

Samples comparing RADTTS without disentanglement vs RADTTS with disentanglement

Sample type RADTTS without disentanglement RADTTS with disentanglement
Hindi speaker (IndicTTS) speaking in English
English speaker (LJS) speaking in Hindi

Samples comparing RADMMM with native accent vs RADMMM with foreign accent (speaker’s original accent)

Sample type RADMMM with native accent RADMMM with foreign accent
Hindi speaker (IndicTTS) speaking in English
English speaker (LJS) speaking in Hindi

Full list of samples can be found here (ablation_samples).

Citation

@misc{badlani2023multilingual,
      title={Multilingual Multiaccented Multispeaker TTS with RADTTS}, 
      author={Rohan Badlani and Rafael Valle and Kevin J. Shih and João Felipe Santos and Siddharth Gururani and Bryan Catanzaro},
      year={2023},
      eprint={2301.10335},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}