Tacotron-2 : Implementation and Experiments

7 min readAug 3, 2018

Why do we want to do Text-to-Speech?

Not one but many reasons where TTS can be used such as accessibility features for people with little to no vision, communication-ware for mute people, voice assistants such as siri, screen readers, automated telephony systems, audio books, easier language learning etc.

In December 2016, Google released it’s new research called ‘Tacotron-2’, a neural network implementation for Text-to-Speech synthesis. Before moving forward, I would like you to checkout the results they posted on their blog https://google.github.io/tacotron/publications/tacotron2/ and get excited about the mechanism.

Aren’t the results awesome and so human-like? Yes, that’s what motivated me to figure out how they did it and try to implement it eventually. I worked on Tacotron-2’s implementation and experimentation as a part of my Grad school course for three months with a Munich based AI startup called Luminovo.AI . I wanted to develop such a synthesizer on Angela Merkel’s speech.

SEQ2SEQ MODEL WITH ATTENTION

The working of the system was described by Jonathan Shen and Ruoming Pang, Software Engineers, Google Brain and Machine Perception Teams,

‘’In a nutshell it works like this: We use a sequence-to-sequence model optimized for TTS to map a sequence of letters to a sequence of features that encode the audio. These features, an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds, capture not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation. Finally these features are converted to a 24 kHz waveform using a WaveNet-like architecture.

First part of the model i.e. Seq2Seq architecture which is responsible for converting texts into mel-spectrograms and these spectrograms are fed in a wave-net model to produce audio waveforms. One interesting thing is these two parts of the Tacotron architecture (Seq2Seq and Wavenet vocoder) can be trained independently. I worked on the Seq2Seq model.

The model is an encoder-attention-decoder setup where they use ‘Location sensitive attention’. The first part is an Encoder which converts the character sequence into word embedding vector. This representation is later consumed by the Decoder to predict spectrograms. Since I was using a German dataset, I made sure that my character space had german alphabets.

The Encoder is composed of 3 convolutional layers each containing 512 filters of shape 5 x 1, followed by batch normalization and ReLU activations.
The next part is the attention network which takes the encoder output as input and tries the summarize the full encoded sequence as a fixed length context vector for each decoder output step.
The output of the final convolutional layer is passed into a single bi-directional LSTM layer containing 512 units (256 in each direction) to generate the encoded features.

ATTENTION-BASED MODELS FOR SPEECH RECOGNITION:

Attention mechanism used here takes into account both the location of the focus in the previous step and the features of input sequence.

Let say we have data x = {x1,x2,x3….xN}. We pass this data to the encoder which produces an encoded output sequence h = {h1,h2,h3….hN}.

A(i) = Attention( s(i-1), A(i-1), h ) where s(i-1) is the previous decoding state and A(i-1) is the previous alignment.

s(i-1) is 0 for the first iteration of first step.

Attention function is usually implemented by scoring each element in h separately and then normalizing the score.

G(i) = A(i,0) h(0) + A(i,1) h(1) + ……. + A(i,N) h(N)

Y(i) ~ Generate ( s(i-1), G(i) )

where s is the decoding output, A(i) is a vector of attention weights called alignment.

Finally, s(i) = Recurrency ( s(i-1), G(i), Y(i) )

Recurrency is usually LSTM.

DECODER :

The decoder is an autoregressive recurrent neural network which predicts a mel spectrogram from the encoded input sequence one frame at a time. The prediction from the previous time step is first passed through a small pre-net containing 2 fully connected layers of 256 hidden ReLU units. The pre- net output and attention context vector are concatenated and passed through a stack of 2 uni-directional LSTM layers with 1024 units. Finally, the predicted mel spectrogram is passed through a 5-layer convolutional post-net which predicts a residual to add to the prediction to improve the overall reconstruction. Each post-net layer is comprised of 512 filters with shape 5 × 1 with batch normalization, followed by tanh activations on all but the final layer.

LOSS FUNCTION:

Summed mean squared error (MSE)

In parallel to spectrogram frame prediction, the concatenation of decoder LSTM output and the attention context is projected down to a scalar and passed through a sigmoid activation to predict the probability that the output sequence has completed. This “stop token” prediction is used during inference to allow the model to dynamically determine when to terminate generation instead of always generating for a fixed duration.

I decided to go with pytorch for my implementation, tracked the training with tensorboard, used gcloud Tesla K80 GPU, connected to server ports by ‘ssh -NfL’, and heavily used jupyter lab during development. [life saver kit]

I referenced various github repositories [1, 2] to understand the paper, implementation, correcting bugs in my own code. Due to the natural complexity of the problem statement, I could not get astonishing human-like speech results but I learned a lot of things about Text-to-speech and that was the major goal when I started doing the project.

Some results for reference:

Predicted mel-spectrogram : As you can compare the upper regions, it has a lot of gaps and still needs a lot of training. The right side (solid green) is just padding in one batch.

Attention (As you can see in the lower left side, it looks like it is learning to align but it still needs around one week of training to get that perfect diagonal for attention)

All the images produced above are after 50K iterations (1 iteration = 1 batch) i.e. 3 days of training. This model needs around 300K iterations to get any close to human-like. You can see that, the predicted mel-spectrograms look pretty nice even when the attention is not learned properly. Save yourself from the trap and care about the attention!

EXPERIMENTS and CONCLUSIONS:

Implementing the model and training it was not as trivial as I thought initially. I came across numerous issues which I want you guys to know beforehand and save hours on your GPU.

Study your data. This is the most important part of the project. Listen to your data samples, check the length of text samples, duration of audio samples etc. You can save a lot of time during training, if you know your data well. M-AILABS announced their huge speech dataset earlier this year. They have humongous speech dataset in many different languages. I used the Angela Merkel data from the German female section, which has 12 hours of speech from her public speeches and interviews. This dataset was lesser as compared to LJSpeech (most popular english dataset, 24 hours of speech). I figured this out only when I started training and spent days observing the results. So, heads up!
TTS is highly computationally expensive. Being a student, I just had access to one GPU (Nvidia Tesla K80) on Google cloud. Given the structure of the dataset I was using, my GPU only allowed batch-size of 8 while training. Google says, they train it with a batch-size of 64. I first tried with a batch-size of 2 (because of limited GPU memory) and when the model failed to show any convergence after 2–3 days of training, I sorted my data as per length of the text and as per duration of the audio, and started training with batch-size 8. Although, I couldn’t optimize more with the dataset and the GPU I had. So, plan accordingly.
Teacher-forcing ratio. In teacher-forced training, the model is assisted by true labels i.e. it uses the current frame of the Ground Truth to predict the next decoding step. It is not clear in the paper regarding what ratio to use. Even if the attention is not learned, the model will predict good frames for training data in teacher-forced mode but in the evaluation mode it will not work because we don’t have ground truth (Thought the model was working since the predicted mels looked nice regardless of poor alignments). I did training with 1.0, 0.75 and 0.5 to make the model learn alignments. During eval mode, teacher-forcing should be turned off.
It takes days to train and get alignments. It is a real cumbersome process to train a TTS system. It might take around 7–10 days to train the model provided that you have limited GPU support (We are no google). And then, debugging the code with such a model, is another story.
Hyperparameter tuning is very important part of Tacotron-2 system. The batch-size, learning-rate, teacher-forcing ratio, batch length are some of the parameters you should pay extra importance too. Things vary with datasets, so it is very sensitive!

CONCLUSION

Text-to-speech is still a really complex research problem and it was exciting to work on this. My overall experience was amazing and I learnt a lot of things about TTS systems, audio waveforms, recurrent networks, mel-spectrograms, attention mechanisms and I hope that this post can help you in any way in your journey with TTS systems. In future, I would like to see an optimized version of Tacotron-2 model, something which is more robust across languages, easier to train and less computationally heavy.

So, I would just say, preprocess your data well, tune your hyperparameters, log everything on tensorboard and get going! All the best!

Special thanks to Luminovo.AI for their support!

Tacotron-2 : Implementation and Experiments

EXPERIMENTS and CONCLUSIONS:

CONCLUSION

Written by Rajanie Prabha