Day 7: Natural TTS Synthesis By Conditioning WaveNet On Mel Spectogram Predictions (Tacotron 2)

Francisco Ingham

Published in

A paper a day avoids neuron decay

8 min readMar 27, 2019

[Dec 16, 2017] The key to generate human-like speech, without predefined features

TL-DR

A unified, entirely neural approach which combines a text to mel-spectogram network similar to Tacotron, followed by a WaveNet vocoder that produces human-like speech. Contrary to WaveNet, they did generate any additional features to feed the vocoder and they got clearly higher quality excerpts than Deep Voice 3.

I personally like this paper very much, they generated world-class results which are used in production and took the time to run Ablation Studies to push the research bar. Thank you Google!

I want to hear! Now!

Ok, ok here.

Architecture

Spectogram prediction network

The network is composed of an encoder and decoder with attention. The encoder ‘encodes’ the input text into a feature representation that is later converted into a spectogram by the decoder (which is later transformed into a mel-spectogram with a short-time Fourier transform).

1. Encoder

The input is represented as a 512-dimensional embedding which passes through a convolutional network which encodes long-term context. The output of the final layer is passed into a bi-directional LSTM to generate encoded features.

2. Attention

The encoder output is fed into the ‘location sensitive attention’ module which uses:

cumulative attention weights from previous decoder time steps as an additional feature.

This section is highly technical and, although I wanted to write about it to make sure I understand it thoroughly, it is not strictly necessary to understand the crux of the paper. If you are not interested in the math, skip directly to the Why is this important? section.

Additive attention/Content-based attention

Let’s start by the basics, additive attention. Additive attention is called additive because the attention vectors are added together to form the context vector:

Where the attention weight matrix α is computed as follows:

Note that the computation of α depends only on the previous state and the corresponding encoder hidden state (the content or input)

Then the context vector is used, together with the previous state and the previous output to generate the current state in the decoder LSTM:

State computation in decoder LSTM

In this graphic representation of content-based attention we can clearly see the addition of the attention result for each of the hidden states before computing the recurrent state:

By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed length vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.

Location-sensitive attention

Location-sensitive attention differs from traditional additive attention in that it considers not only the input for each datapoint but also its position in terms of the longer sequence. In particular, the big difference here is in the computation of α. In location-sensitive attention, α_i is computed by using the previous alignment α_i-1 as an input:

The computation of α depends on the previous state’s allignment/weights

And then the next state is computed in the same manner as in additive attention:

Here the context vector is called g for **glimpse**

**s_i** is computed in the same way as additive attention

Let’s inspect this graphically. Notice how the previous state’s α is being used in the computation of the next state’s α:

Why is this important?

The authors state that:

Informally, we would like an attention model that uses the previous alignment α_i−1 to select a short list of elements from h, from which the content-based attention (…) will select the relevant ones without confusion.

In other words:

This encourages the model to move forward consistently through the input, mitigating potential failure modes where some subsequences are repeated or ignored by the decoder.

For example, say we are training on a speech by Arnold Schwarzenegger. Let’s say that our current input to the network is:

And yes, I can imagine with your name, Arnold Schwartzenschnitzel or whatever the name is, on a billboard.

Let’s say that, for the last output, the network’s alignment weights show a focus in imagine. By using these alignments the network knows that the next focus should at least start in with. Maybe go all the way to with but definitely not imagine or with (this is the short-list from where the content-based attention will select the next focus). This helps the network not to repeat or skip syllables.

3. Decoder

The decoder is composed of 2 LSTM’s which predict a melspectogram from the encoded input.

The prediction from the previous timestep is passed through a pre-net with 2 fully-connected layers which bottlenecks the previous timestep’s information. The output of the pre-net is concatenated to the output of the attention module and these are both fed into the LSTM’s.

The output of the LSTM’s are projected through a linear transform to predict the spectogram frame. Finally these frames are passed through a 5-layer convolutional post-net which improves the spectogram. The loss function in the decoder is a summed MSE of the predicted spectograms before and after the post-net (this aids convergence).

The concatenated output of the pre-net and attention-module are projected into a scalar which predicts if to terminate generation.

The network is regularized with 0.5 probability dropout in convolutional layers and 0.1 probability zoneout in LSTM layers.

WaveNet vocoder

The vocoder is a modified version of the WaveNet vocoder. They use 30 dilated convolution layers, grouped into 3 dilation cycles (remember as per my previous blogpost that each dilation cycle increases the receptive field exponentially) and 2 upsampling layers.

Instead of using a softmax, the authors used a 10-component mixture of logistic distributions (MoL) to generate 16-bit samples at 24kHz.

To compute the logistic mixture distribution, the WaveNet stack output is passed through a ReLU activation followed by a linear projection to predict parameters (mean, log scale, mixture weight) for each mixture component. The loss is computed as the negative log-likelihood of the ground truth sample.

Why mel-spectograms?

The authors give an explanation to why mel-spectograms work well in TTS systems. This explanation is relevant to other papers as well, such as Deep Voice 3.

A mel-spectogram is derived from a linear-frequency spectogram (short-time Fourier transform magnitude). It is obtained by applying a nonlinear transform to the frequency axis of STFT, to reduce dimensionality. It emphasizes details in low frequencies which are very important to distinguish speech and de-emphasizes details in high frequencies which usually are noise and do not contribute to a correct understanding of the audio file.

Spectograms in general and mel-spectograms in particular are lossy and discard phase information. However, the WaveNet vocoder is used to working with more complex features (harder to extract signal from noise) than a mel-spectogram and it is demonstrated empirically that it can generate high quality audio from mel-spectograms as well.

Experiments

Training

The training dataset is an internal, female-spoken US English dataset.

To train the Spectogram Prediction Network the authors use the ground truth to generate the pre-net’s input which gets then fed into the decoder (teacher-enforcing). This ensures that predicted frames align with target waveform samples.

Evaluation

In evaluation teacher-enforcing cannot be used because the ground-truth is not known 😉.

The evaluation set is a 100-example long dataset which is sent to a human-rating service similar to Amazon’s Mechanical Turk where each sample is rated by at least 8 raters in a scale from 1–5 from which a Mean Opinion Score is calculated.

The authors proved that using mel-spectograms instead of linear-spectograms or linguistic features was key to improving the models performance.

Using mel-spectograms and a WaveNet vocoder vs other approaches

The authors also analyzed the types of errors that the model incurs on in the test set and found that by far the most frequent mistake the model makes is unnatural pitch or emphasis and (23/100) and concluded that this should be the main area of improvement to focus on.

They also tried the model on news headline to see how it generalized and found that it sometimes ran into pronunciation difficulties showing that end-to-end TTS systems still require to train on the data they are going to be evaluated on (difficulty with zero-shot examples).

Ablation Studies

Predicted Features vs Ground Truth

An alternative while training is to train the WaveNet vocoder on the grouth-truth mel-spectograms instead of the decoder’s predictions. The authors experimented with this in training and inference. The results show that the best results are obtained when features used for training match those used for inference (the network is trained on the same type of data than what it is tested on).

Results when using predicted/ground-truth melspectograms on training and inference

Linear Spectograms

The authors tried with three different combinations of spectorams/vocoders. The results show that the WaveNet vocoder makes a big difference compared to Griffin-Lim but the type of spectogram does not. Given these results, the authors decided to use the WaveNet vocoder with mel-spectograms, given that their representation is more compact.

Post-Processing Network

Another question the authors asked themselves was “Given that the WaveNet vocoder already has convolutional layers in it, how useful is the post-processing network?”.

(…) we compared our model with and without the post-net, and found that without it, our model only obtains a MOS score of 4.429 ± 0.071, compared to 4.526 ± 0.066 with it, meaning that empirically the post-net is still an important part of the network design.

Simplyfing WaveNet

The authors tested different receptive fields within the WaveNet vocoder to see if a lower receptive field indeed reduced audio quality. They found that reducing the receptive field ~25 times still produced high-quality audio. They believe this is due to the choice of using mel-spectograms to condition the vocoder since mel-spectograms are a better representation of the waverform capture long-term dependencies across frames.

Eliminating dilated convolutions entirely significantly reduces audio quality suggesting that there is a context ‘sweet spot’ the model requires to generate high quality sound.