Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens



Rafael Valle, Jason Li, Ryan Prenger, and Bryan Catanzaro

In our recent paper, we propose Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice.

Unlike other methods, we train Mellotron using only read speech data without alignments between text and audio. We evaluate our models using the LJSpeech and LibriTTS datasets. We provide F0 Frame Errors and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.

Below we provide style transfer and singing voice synthesis samples produced with Mellotron and WaveGlow. Code for training and inference, along with a pretrained model on LJS and LibriTTS, will available on our Github repository.

4.3.1 Rhythm Transfer

Nicki Minaj - Your Love Mellotron 0.5x to 2x speed Mellotron without F0 Mellotron with F0

4.3.2 Rhythm and Pitch Transfer

Source Tacotron 2 Mellotron
Source E2E Prosody Transfer Mellotron

4.4.1 Singing Voice from Audio Signal

Source E2E Prosody Transfer Mellotron
Adele - Rumour Has It Mellotron LJS Mellotron Sally
Kaushiki Chakrabarty - Raga Multani Mellotron LJS Mellotron Sally

4.4.2 Singing Voice from Music Score

Haendel - Hallelujah Ligeti - Lux Aeterna Debussy - Prelude To The Afternoon Of A Faun