Baidu Deep Voice explained: Part 1 — the Inference Pipeline

This post is the first in what I hope to be a series covering recently published ML/AI papers that I think are particularly important. Some of the ideas in these papers are fairly intuitive and I hope I’m able to communicate some of that intuition in this format. For the first paper, I’ll be covering Baidu’s Deep Voice paper that applies Deep Learning to Text to Speech Systems. Read on!

Baidu’s Deep Voice

Arxiv Link: https://arxiv.org/abs/1702.07825

Institution: Baidu Research

Recently, Andrew Ng’s Baidu AI Team released an impressive paper on a new Deep Learning based system for converting text to speech. An example of the speech that Baidu’s paper is able to produce is shown below. The results speak for themselves:

Baidu’s text to speech result. Source: http://research.baidu.com/deep-voice-production-quality-text-speech-system-constructed-entirely-deep-neural-networks/

Clearly, Baidu’s results sound natural and human compared to MacOS’ production TTS system. The above should be seen with one big caveat though — the Baidu sample had the chance to train on a ground recording of someone saying that sentence which gives it a far more human like quality. Additionally, the Baidu sample has access to frequency and duration data as well.

But beyond just the quality of the output, there are a few key ways in which this paper has broken new ground in the speech world:

  1. Deep Voice uses Deep Learning for all pieces of the text to speech pipeline.

Previous TTS (Text to Speech) systems used Deep Learning for different components of the pipeline but no previous work has gone so far as to replace all major components with Neural Networks before this paper.

2. It requires very little feature engineering and is hence easy to apply to different datasets.

Through using Deep Learning, the authors are able to avoid a large amount of feature processing and engineering compared to traditional pipelines. This makes Deep Voice far more general and applicable to different problem domains. In fact, Deep Voice can be retuned in a matter of a few hours as compared to weeks in traditional systems, as described by the authors of the paper below:

In a conventional TTS system [retraining] requires days to weeks of tuning, whereas Deep Voice allows you to do it in only a few hours of manual effort and the time it takes models to train.

3. It is extremely fast compared to the state of the art and is designed to be used in production systems.

The authors of this paper claim to effectively achieve a 400x speed up over WaveNet, DeepMind’s seminal paper on human-like audio synthesis. In particular, they write:

We focus on creating a production-ready system, which requires that our models run in real-time for inference. Deep Voice can synthesize audio in fractions of a second, and offers a tunable trade-off between synthesis speed and audio quality. In contrast, previous results with WaveNet require several minutes of runtime to synthesize one second of audio.

Background Material

So clearly there’s a lot to be excited about! But how does it all work? In the rest of this post, I’ll attempt to go over the different pieces of Deep Voice at a high level and how they fit together. The following are a few useful prerequisites that will help you follow along:

  • Adam Coates’ lecture (watch from 3:49) on applying Deep Learning in Speech at Baidu. Dr. Coates is one of the authors of Deep Voice and worked with Andrew Ng at Stanford previously.

Now that you’ve taken a look at some of the background material, it’s time to dive into how Deep Voice works! The rest of the blog post will adhere to the following structure:

  1. First, we’ll look at how Deep Voice takes an example sentence and converts it into speech at a high level. This is known as the inference pipeline.
  2. We’ll then break down the inference pipeline into smaller pieces and understand the role of each piece.
  3. In future blog post, we’ll cover how we actually train these separate pieces and what the actual training data and labels look like.
  4. Finally, in another blog post we’ll look into the Deep Learning architectures used to implement these different components.

The Inference Pipeline — Converting New Text to Speech

Let’s now see at a high level how Deep Voice takes a simple sentence and converts it into audio that we can hear.

The pipeline, as we will see, will have the following architecture:

The inference pipeline of Deep Voice. Source: https://arxiv.org/pdf/1702.07825.pdf

Let’s now go through this pipeline step by step to get an understanding of what these pieces are and how they fit together. In particular, we’ll trace the following phrase and see how it is processed by Deep Voice:

It was early spring.

Step 1: Convert Graphemes (Text) to Phonemes

Languages such as English are peculiar in that they aren’t phonetic. For instance, take the following words (adapted from here) that all use the suffix “ough”:

1. though (like o in go)

2. through (like oo in too)

3. cough (like off in offer)

4. rough (like uff in suffer)

Notice how they all have fairly different pronunciations even though they have the same spelling. If our TTS system used spelling as its main input, it would inevitably run into problems trying to reconcile why “though” and “rough” should be pronounced so differently, even though they have the same suffix. As such, we need to use a slightly different representation of words that reveal more information about the pronunciations.

This is exactly what phonemes are. Phonemes are the different units of sound that we make. Combining them together, we can recreate the pronunciation for almost any word. Here are a few examples of words broken into phonemes (adapted from CMU’s phoneme dictionary):

  • White Room — [W, AY1, T, ., R, UW1, M, .]
  • Crossroads — [K, R, AO1, S, R, OW2, D, Z, .]

The numbers 1, 2 etc. next to the phonemes represent where the stress of the pronunciation should be placed. Additionally, periods represent empty space in the pronunciation.

So, the first step in Deep Voice will be to simply convert every sentence into its phoneme representation using a simple phoneme dictionary like this one.

Our Sentence

So, for our first step, Deep Voice will have the following inputs and outputs.

  • Input - “It was early spring”
  • Output - [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]

We’ll cover how we train such a model in the next blog post.

Step 2, Part 1: Duration Prediction

Now that we have the phonemes, we need to estimate just how long we should hold out these phonemes while speaking. This is again an interesting problem as phonemes should be held for longer and shorter durations based on their context. Take the following examples surrounding the phonemes “AH N”:

  • Unforgettable
  • Fun

Clearly “AH N” needs to be held out far longer in the first case than the second and we can train a system to do just that. In particular, we’ll take each phoneme and predict how long we’ll hold it for (in seconds).

Our Sentence

Here’s what will happen to our example sentence at this step:

  • Input - [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
  • Output - [IH1 (0.1s), T (0.05s), . (0.01s), … ]

Step 2, Part 2: Fundamental Frequency Prediction

The fundamental frequency (the blue line) is the lowest frequency the vocal cords produce during the voiced phoneme (think of it as the shape of the waveform). We’ll aim to predict this for each phoneme.

We’ll also want to predict the tone and intonation of each phoneme to make it sound as human as possible. This, in many ways, is especially important in languages like Mandarin where the same sound can have an entirely different meaning based on the tone and accent. Predicting the fundamental frequency of each phoneme helps us do just this. The frequency tells the system exactly what approximate pitch or tone the phoneme should be pronounced at.

Additionally, some phonemes aren’t meant to be voiced at all. This means that they are pronounced without any vibrations of vocal cords.

As an example, say the sounds “ssss” and “zzzz” and notice how the former causes no vibrations in your vocal cords (is unvoiced) while that later does (is voiced).

Our fundamental frequency prediction will also take this into account and predict when a phoneme should be voiced and when it should not.

Our Sentence

Here’s what will happen to our example sentence at this step:

  • Input - [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
  • Output - [IH1 (140hz), T (142hz), . (Not voiced), …]

Step 3: Audio Synthesis

In the final step, we’ll combine phonemes, durations, and the fundamental frequencies (fO profile) to create real audio.

The final step to creating speech is taking together the phonemes, the durations, and the frequencies to output sound. Deep Voice achieves this step using a modified version of DeepMind’s WaveNet. I highly encourage you to read their original blog post to get a sense of the underlying architecture of WaveNet.

The original WaveNet from DeepMind can have exponentially many different inputs contribute to a single input. Notice the exponential tree structure outlined above. Source: https://deepmind.com/blog/wavenet-generative-model-raw-audio/

At a high level, WaveNet generates raw waveforms allowing you to create all types of sound including different accents, emotions, breaths, and other basic parts of human speech. Additionally, WaveNet can even take this one step further to generate music.

In this paper, the Baidu team modifies WaveNet by optimizing its implementation especially for high frequency inputs. As such, Where WaveNet required minutes to generate a second of new audio, Baidu’s modified WaveNet can require as little as just a fraction of a second as described by the authors of Deep Voice here:

Deep Voice can synthesize audio in fractions of a second, and offers a tunable trade-off between synthesis speed and audio quality. In contrast, previous results with WaveNet require several minutes of runtime to synthesize one second of audio.

Our Sentence

Here are the inputs and outputs at this final step of Deep Voice’s pipeline!

  • Input - [IH1 (140hz, 0.5s), T (142hz, 0.1s), . (Not voiced, 0.2s), W (140hz, 0.3s),…]
  • Output - see below.
Baidu’s text to speech result. Source: http://research.baidu.com/deep-voice-production-quality-text-speech-system-constructed-entirely-deep-neural-networks/

Summary

And that’s it! With these 3 steps, we’ve seen how Deep Voice takes in a simple piece of text and discovers its audio representation. Here’s the summary of the steps once more:

  1. Convert text into phonemes. “It was early spring”
  • [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]

2. Predict the durations and frequencies of each phoneme.

  • [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .] -> [IH1 (140hz, 0.5s), T (142hz, 0.1s), . (Not voiced, 0.2s), W (140hz, 0.3s),…]

3. Combine the phonemes, the durations, and the frequencies to output a sound wave that represents the text.

  • [IH1 (140hz, 0.5s), T (142hz, 0.1s), . (Not voiced, 0.2s), W (140hz, 0.3s),…] -> Audio

But how do we actually train Deep Voice to be able to carry out the above steps? How does Deep Voice leverage Deep Learning to achieve its goals?

In the next blog post, we’ll cover how each piece of Deep Voice is trained and provide more intuition behind the underlying neural networks. Read it here: