Day 3: WaveNet: A Generative Model for Raw Audio

Francisco Ingham

Published in

A paper a day avoids neuron decay

6 min readMar 12, 2019

[Sep 12, 2016] The key to generating human-like speech

TL-DR

WaveNet is a deep neural network that yields state of the art performance in text to speech and it can be used for several speakers by conditioning on speaker identity. WaveNet also shows promising performance in music modelling and speech recongition (speech to phonemes).

If it is so SOTA let me hear it

Ok, here you go. Now we can get to the nitty gritties.

Architecture

Convolutions

The general architecture for WaveNet is not very special. It is similar to PixelCNN: it stacks convolutional layers without pooling layers with an output of the same dimensionality as the input. However, there is an important difference with PixelCNN which makes WaveNet special and it is that instead of using masking to avoid violating conditional dependence they use causal dilated convolutions.

Causal: the prediction p (xt+1 | x1, …, xt) emitted by the model at timestep t cannot depend on any of the future timesteps xt+1, xt+2, . . . , xT.

Causal convolutions: Given a position, a neuron in that position will only depend on input on previous timesteps

The main advantage of causal convolutions compared to RNN’s (which also respect conditional dependence) is that they are faster to compute since there are no recurrent connections. However, causal convolutions have a problem. To have a large receptive field you either need a large number of layers or a large kernel size (in the figure above you can see the receptive field is 5). Having a large receptive field is important since that will allow long-range coherence in the audio. Enter dilated convolutions.

Dilated convolutions: A dilated convolution (also called ‘a trous’, or convolution with holes) is a convolution where the filter is applied over an area larger than its length by skipping input values with a certain step.

Causal dilated convolutions: Given a position, a neuron in that position will only depend on some neurons from the previous layer and the number of neurons affecting a neuron in the next layer decreases exponentially.

Dilated convolutions allow to increase the receptive field by skipping some neurons in the way. Notice that although the number of neurons that affect each subsequent neuron stays the same (2), each output value is directly affected by all input units (see figure above). With the same compute, the authors achieve a larger receptive field and avoid one of the biggest drawbacks of causal convolutions. Furthermore, notice that the receptive field increases exponentially with the number of layers.

Softmax

Audio is typically stored as a sequence of 16-bit integer values which corresponds to a 65,536 long output per timestep. The authors decided to reduce this range to a 256 range by using µ-law companding transformation:

Where −1 < xt < 1 and µ = 255.

Gated Activation Units

The authors use the same gated activation unit as PixelCNN:

*: convolution operator, ⦿: element-wise multiplication operator, σ: sigmoid function, k: layer index, f: filter, g: gate, W: learnable convolution filter

Residual and Skip Connections

Both residual and parameterised skip connections are used throughout the network, to speed up convergence and enable training of much deeper models.

Conditional WaveNet

The WaveNet architecture can be conditioned in additional inputs to generate audio with specific characteristics (e.g. a different speaker). This can be done in one of two ways: global conditioning or local conditioning.

Global conditioning: characterized by a single latent representation h that influences the output distribution across all timesteps, e.g. a speaker embedding in a TTS model.

Activation function with global conditioning. V_*_k is a learnable projection.

Local conditioning: a second timeseries h_t, possibly with a lower sampling frequency than the audio signal, e.g. linguistic features in a TTS model. We first transform this time series using a transposed convolutional network (learned upsampling) that maps it to a new time series y = f(h) with the same resolution as the audio signal.

Activation function with local conditioning. y is the result of mapping h to the same frequency as the audio signal.

Context Stacks

Another way the authors suggest to increase the receptive field (the main challenge with this architecture as you might have inferred) is to use context stacks. A context stack basically processes a longer part of the audio signal and locally conditions the model on that part (this allows the model to have a larger receptive field, summarized in the conditioning).

Experiments

The most compelling proof of the success of this algorithm is checking out the examples presented in the beginning of this article. The key metric here is ‘naturalness’ so you can put on the reviewer hat and review the algorithm yourself 🎼.

Multi-speaker speech generation

A single WaveNet was able to model speech from any of the speakers by conditioning it on a onehot encoding of a speaker [global conditioning]. This confirms that it is powerful enough to capture the characteristics of all 109 speakers from the dataset in a single model. We observed that adding speakers resulted in better validation set performance compared to training solely on a single speaker. This suggests that WaveNet’s internal representation was shared among multiple speakers

Text-to-speech

WaveNet outperformed the baseline statistical parametric and concatenative speech synthesizers in both languages. We found that WaveNet conditioned on linguistic features could synthesize speech samples with natural segmental quality but sometimes it had unnatural prosody by stressing wrong words in a sentence. This could be due to the long-term dependency of F0 contours: the size of the receptive field of the WaveNet, 240 milliseconds, was not long enough to capture such long-term dependency. WaveNet conditioned on both linguistic features and F0 (logartihmic fundamental frequency) values [local conditioning] did not have this problem: the external F0 prediction model runs at a lower frequency (200 Hz) so it can learn long-range dependencies that exist in F0 contours

WaveNet conditioned on linguistic features and F0 beat the baselines in a naturalness

Music

The authors also used this model to create music. Although there no objective metric was presented, there is a subjective assessment of the coherence of the results. Their major takeway is that a large receptive field is crucial to get good results. The music excerpts achieved were harmonic and aesthetically pleasing but they lacked in long-term coherence.

They also used genre and instrument conditioning (global conditioning) on the models and they report promising results.

Speech recognition

The model achieved SOTA 18.8 PER on the TIMIT dataset (deriving phonemes from speech). The authors modified the architecture to fit this new task by adding a pooling layer after the causal convolutions that compressed the activations followed by some classic convolutional layers and modifying the loss to a hybrid between predicting the next sample (similar to language modelling) and classifying the frame (classification) since this helped generalization.

References:

WaveNet: A Generative Model for Raw Audio; van de Oord et al., Google, 2016

Conditional Image Generation with PixelCNN Decoders; van de Oord et al., Google Deep Mind, 2016

Image source: Wikipedia