Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
Published:
Rafael Valle, Kevin Shih, Ryan Prenger, and Bryan Catanzaro
In our recent paper, we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer. Flowtron combines insights from IAF and optimizes Tacotron 2 in order to provide high-quality and controllable mel-spectrogram synthesis.
FlowTron is trained by maximizing the likelihood of the training data, which makes the training procedure simple and stable. Flowtron learns an invertible mapping of data to a latent space that can be manipulated to influence many aspects of mel-spectrogram synthesis.
Below we provide samples produced with Flowtron for mel-spectrogram synthesis and WaveGlow for waveform synthesis. Code for training and inference, along with pretrained models on LJS and LibriTTS, will be available on our Github repository.
Flowtron has Mean Opinion Scores (MOS) comparable to state of the art text to speech models. Here we provide a sample from Flowtron and Tacotron 2 trained on the LJSpeech dataset.
With Flowtron we can control the amount of prosodic variation in speech by adjusting σ². Despite all the variability added by increasing σ², all the samples synthesized with Flowtron still produce high quality speech. The three columns contain three separate samples, so that you can compare variation for each value of σ², and also compare with Tacotron 2 variation. With Flowtron, we can create samples with highly varying prosody, which can make the voice much less monotonous.
Flowtron model with speaker embeddings. We interpolate between two random z-vectors with the speaker Sally and the phrase "It is well known that deep generative models have a rich latent space".
1/100
33/100
66/100
100/100
Flowtron same speaker
Please visit our blogpost for examples in which we interpolate between z-vectors producing speech from Sally and Helen with the phrase "We are testing this model".
We compare Sally samples from Flowtron and Tacotron 2 GST generated by conditioning on the posterior computed over 30 Helen samples with the highest variance in fundamental frequency. The goal is to make a speech from a monotone speaker more expressive by sampling a region of Flowtron's z-space that is associated with a different speaker that has more expressivity.
We illustrate Flowtron's ability to learn and transfer acoustic characteristics that are hard to express algorithmically but easy to perceive acoustically. We transfer the style with distinguished nasal voice and oscillation in fundamental frequency to our Flowtron baseline speaker.
We modify a speaker's style by using data from the same speaker but from a style not seen during training. Flowtron succeeds in transferring the somber style and the long pauses associated with the narrative style.
We transfer the style from speaker ID 03 from RAVDESS and the label "surprised" to Sally. Flowtron is able to make Sally sound surprised, which is drastically different from the monotonous baseline.
Flowtron
Style
Flowtron Style Transfer
We transfer Richard Feynman's prosody and acoustic characteristics to Sally. Flowtron is able to pick up some of the prosody and articulation details particular to Feynman's speaking style and transfer them to Sally.
We select a single component from the gaussian mixture and translate a dimension associated with pitch. Although the samples have different pitch contours, they have the similar duration.
μ (a-flat)
μ - 2σ (c)
μ - 4σ (e-flat)
We select a single component from the gaussian mixture and translate a dimension associated with speech rate. Although the samples have different speech rates, they have similar pitch contour.