Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
Rafael Valle, Kevin Shih, Ryan Prenger, and Bryan Catanzaro
In our recent paper, we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer. Flowtron combines insights from IAF and optimizes Tacotron 2 in order to provide high-quality and controllable mel-spectrogram synthesis.
FlowTron is trained by maximizing the likelihood of the training data, which makes the training procedure simple and stable. Flowtron learns an invertible mapping of data to a latent space that can be manipulated to influence many aspects of mel-spectrogram synthesis.
Below we provide samples produced with Flowtron for mel-spectrogram synthesis and WaveGlow for waveform synthesis. Code for training and inference, along with pretrained models on LJS and LibriTTS, will be available on our Github repository.
With Flowtron we can control the amount of prosodic variation in speech by adjusting σ². Despite all the variability added by increasing σ², all the samples synthesized with Flowtron still produce high quality speech. The three columns contain three separate samples, so that you can compare variation for each value of σ², and also compare with Tacotron 2 variation. With Flowtron, we can create samples with highly varying prosody, which can make the voice much less monotonous.
We compare Sally samples from Flowtron and Tacotron 2 GST generated by conditioning on the posterior computed over 30 Helen samples with the highest variance in fundamental frequency. The goal is to make a speech from a monotone speaker more expressive by sampling a region of Flowtron's z-space that is associated with a different speaker that has more expressivity.
We illustrate Flowtron's ability to learn and transfer acoustic characteristics that are hard to express algorithmically but easy to perceive acoustically. We transfer the style with distinguished nasal voice and oscillation in fundamental frequency to our Flowtron baseline speaker.
We modify a speaker's style by using data from the same speaker but from a style not seen during training. Flowtron succeeds in transferring the somber style and the long pauses associated with the narrative style.
We transfer the style from speaker ID 03 from RAVDESS and the label "surprised" to Sally. Flowtron is able to make Sally sound surprised, which is drastically different from the monotonous baseline.
Flowtron Style Transfer
We transfer Richard Feynman's prosody and acoustic characteristics to Sally. Flowtron is able to pick up some of the prosody and articulation details particular to Feynman's speaking style and transfer them to Sally.