Day 6: Deep Voice 3: Scaling Text to Speech with Convolutional Sequence Learning

Francisco Ingham
A paper a day avoids neuron decay
8 min readMar 19, 2019

[Oct 20, 2017] Real-time, high quality, multi-speaker text-to-speech inference

No matter how many layers we stack, we know who holds state of the art in TTS ❤

TL-DR

High naturalness text-to-speech synthesis in a fraction of the time with a convolutional, attention based architecture. The authors train with a big dataset and describe how to get 10 million queries a day of high-naturalness text-to speech inference on a single GPU server.

If it is so natural I want to hear it

Suit yourself. Now if you are interested in the how as well, let’s dive into the architecture.

Model Architecture

Deep Voice 3 architecture: attention-based sequence-to-sequence model

Deep Voice 3 encodes text into per-timestep key and value vectors for an attention-based decoder. The decoder uses these vectors to predict mel-spectograms that correspond to the audio output. A converter finally predicts the vocoder parameters for waveform synthesis.

Encoder: Fully-convolutional encoder which converts textual features to an internal representation.

Decoder: Fully-convolutional decoder, which decodes the learned representation using attention into a low-dimension mel-spectogram in an autoregressive manner.

Converter: Fully-convolutional post-processing network which predicts vocoder parameters from decoder hidden states.

Text pre-processing

A few actions were taken on the text by the authors to improve the quality of the audio prediction:

  1. Uppercase all letters.
  2. Remove punctuation marks.
  3. End every utterance with a period or a question mark.
  4. Replace spaces between words with separator characters that represent the time between words uttered by the speaker.

Joint representation of characters and phonemes

TTS systems should always include some kind of pronunciation correction for common mistakes. This is usually done by training on a text to phoneme dataset.

The conventional mode converts characters to acoustic features and thus learns an implicit grapheme-to-phoneme model. To avoid learning wrong representations of certain characters, the authors trained phoneme-only models and mixed character-and-phoneme models. In the mixed character-and-phoneme model, every word was replaced with its phoneme representation with some fixed probability.

Convolution Blocks for Sequential Processing

Convolutional block in detail

By providing a sufficiently large receptive field, stacked convolutional layers can utilize long-term context information in sequences without introducing any sequential dependency in computation.

The block consists of a 1-D convolution, a gated-linear unit (where the input is divided in two) and a scaling factor of sq(0.5). A speaker-dependent embedding is added as a bias after a softsign function to account for speaker differences. The convolutions in the encoder and converter are non-causal and the ones in the decoder are causal. Inputs are padded with k-1 timesteps of zeros on the left for causal convolutions and (k-1)/2 timesteps of zeros on the left and right for non-causal convolutions, where k is an odd convolution filter width. Dropout is applied pre-convolution. Convolution weights are initialized with zero-mean and unit-vairance activations.

Encoder

The text-embedding is first projected via a fully-connected layer from the embedding dimension to a target dimensionality. Then they are processed through convolution blocks. Finally they are projected back to the embedding dimension to create the attention key vectors. The attention value vectors are computed from a combination of the key vectors and the text embedding (through the skip-connection).

Decoder

The decoder generates audio in an autoregressive manner by predicting a group of r future audio frames conditioned on the past audio frames. Since the decoder is autoregressive, it must use causal convolution blocks.

The decoder uses mel-band log-magnitude spectograms as audio frame representation.

It starts with multiple fully-connected layers to process the mel-spectograms (PreNet in the image, dropout is applied to all except the first). This is then followed by causal convolutions and attention blocks. The convolutional blocks send the queries to the attention blocks which interact with the encoder’s output (this is how past audio frames interact with the new encoded text information). Finally a fully-connected layer outputs the next r frames and a binary specifying whether it is the last frame.

An L1 loss is computed using the output mel-spectogram and a binary cross-entropy is computed using the final-frame prediction.

Attention Block

The attention block is computed used dot-product attention. For more info on how attention works, check out this.

Dot-product attention
Attention block

The authors made some design decisions on the attention block to optimize performance:

  1. Positional encodings

The authors used positional encodings with the objective of nudging the attention distribution to have the form of a diagonal line (these functions form an othonormal basis). These encodings were used in all attention-blocks.

Positional encoding for even i where i is the timestep index, k is the channel index in the positional encoding, d is the total number of channels in the positional encoding and w_s is the position rate (1) of the encoding
Positional encoding for odd i where i is the timestep index, k is the channel index in the positional encoding, d is the total number of channels in the positional encoding and w_s is the position rate (1) of the encoding

2. Monotonic attention

What is monotonic attention?

(…) at a given output timestep the attention probability mass almost never falls before where it was at a previous output timestep.

In other words, it stems from the intuition that when computing attention, the sounds in the audio output should concentrate in the letter(s) they represent and that, as we progress over the audio signal, the position of this letter(s) should be increasing. As an example of this, look at the following image where the heat map over the audio signal coincides progressively with the input text.

Monotonical alignment example with input text on the y axis and audio spectogram on the x axis. Source: Colin Raffel

To have a production-ready TTS service, the algorithm needs to avoid repeated or missed words. Monotonic attention helps the model with this problem. However, in practice this strategy sometimes yields diffused attention distribution. Attending to too many characters at the same time, more than necessary to represent the phoneme, will negatively affect the quality of the output.

The authors propose an alternative strategy of constraining attention weights to be monotonic only at inference (by computing the softmax as from the highest attention weight character instead of from the first character). This allows the model to train freely and learn how to represent sounds from groups of characters as accurately as possible and then forces the network to use that knowledge in a monotonic manner at inference. It’s like saying to the model:

Learn however you like as best as you can but when on the field you will need to follow some rules, no skipping or repeating!

The effect is a clearly monotonic curve on inference which would not necessarily happen without this constraint, as we can see below:

Attention alignment for different characters and timesteps. On the left, attention alignment after training without inference constraints, on the right attention alignment after training with inference constraints on the first and third layers

Converter

The converter network takes as inputs the activations from the last hidden layer of the decoder, applies several non-causal convolution blocks, and then predicts parameters for downstream vocoders. Unlike the decoder, the converter is non-causal and non-autoregressive, so it can use future context from the decoder to predict its outputs.

The converter’s loss function depends in the vocoder used:

Griffin-Lim: converts spectograms to time-domain audio waveforms. L1 loss is used for prediciton of linear-scale log-magnitude spectograms.

WORLD: predicts four values, a boolean for whether the frame is voiced, an F0 value (if voiced), spectral envelope, aperiodicity parameters. Cross entropy is used for the boolean and L1 loss is used for everything else.

WaveNet: mel-scale log-magnitude spectograms are input as external conditioners. The WaveNet vocoder is trained using ground-truth mel-spectograms and audio waveforms. L1 loss on mel-scale spectograms is used at decode and L1 loss on linear-scale spectogram is also applied as vocoder.

Full model as described in the previous sections

Results

Speed

The authors compare their model’s speed with Tacotron. For single-speaker data, a training iteration (bs=4) takes 0.06 seconds while it takes 0.59 seconds for Tacotron (x10 speedup). Additionally DV 3 converges after ~500k iterations while it takes Tacotron ~2M (x4 times). The authors argue the speedup is due to the fully-convolutional natures of Deep Voice which takes advantage of GPU parallelism during training.

Attention error

Attention-based TTS systems may run into errors that harm quality (repeated syllables, mispronounced syllables, missing syllables).

Example phrase to demonstrate common attention-based errors

The first and last of these can be improved by constraining the attention function to impose a monotonically progressing mechanism. The authors tested several different approaches with a 100-sentence test set with particularly challenging words. The approach with Phonemes & Characters with a dot-product attention and monotonicity at inference time outperformed the other approaches.

Attention error experiments and results

Naturalness

For single-speaker data, the model is trained on an internal English speech dataset with ~20 hours. The WaveNet vocoder beats the other vocoders in naturalness by a large margin (3.78 MOS vs 3.63 WORLD and 3.62 Griffin-Lim). The WORLD vocoder might still be preferable since although it introduces various artifacts, it runs 40X realtime per CPU core while WaveNet runs 3X realtime per CPU core.

MOS for different models. DV3 with WaveNet achieves SOTA, comparing with Tacotron.

Multi-Speech Synthesis

For multi-speaker synthesis the authors trained on VCTK (108 speakers, ~44 hours) and LibriSpeech (2484 speakers, ~820 hours). For LibriSpeech the authors apply standard denoising (SoX) and splitting long sentences in the pause locations (Gentle). The authors demonstrate that DV V3 generalizes to multi-speaker datasets although DV V2 still holds the SOTA (optimized phoneme duration and fundamental frequency prediction models).

MOS for different models. DV2 with WaveNet achieves SOTA, authors speculate DV3 with WaveNet vocoder could improve results but substantially slow down inference.

Deployment

This is a very important section of the paper since the model was developed for real-time, production usage. The authors achieved a throughput of 10M queries a day on a single-GPU server with twenty CPU cores. How did they achieve this?

  1. Use of custom GPU kernels for DV 3 inference.
  2. Launch a single kernel for the model (launching individual CUDA kernels for each operation in the graph is unpractical since the number of operations is too high).
  3. Each stream (kernel) operates on an utterance and the authors launch as many streams as there are Streaming Multiprocessors on the GPU.
  4. WORLD vocoder synthesis is parallelized across the 20 CPU cores. Bottleneck becomes GPU inference, with 115 QPS.

Notes

(1) Position rate is the average slope of the attention distribution (roughly speed of speech)

--

--