Understanding WaveNet architecture

WaveNet is deep autoregressive, generative model, which
produces human-like voice, where raw audio is feeded as input to the model, taking speech synthesis to another level.

5 min readMay 20, 2019

WaveNet is combination of two different ideas wavelet and Neural networks. Raw audio is generally represented as a sequence of 16 bits. 16 bits samples produces ²¹⁶ (65536) quantization values, which are processed through softmax, making it computationally expensive. Hence the sequences of samples is reduced to 8 bits, using μ-law transformation, F(x) = sign(x) ln(1+μ|x|)/ln(1+μ), -1 ≤ x ≤ 1, where μ takes value from 0 to 255 and x denotes input samples and then quantize to 256 values

The first step is an audio preprocessing step, after the input
waveform is quantized to a fixed integer range. The integer
amplitudes are then one-hot encoded. These one-hot encoded samples are passed through causal convolution.

Causal Convolution Layer

In signal and system, Causal system is system is referred as the output which depends on past and current inputs but not on future inputs. It is practically possible to implement causal system. In WaveNet the current acoustic intensity of the neural network is produced at time step t only depends on data before t.

This layer is the main part of architecture, as it signifies autoregressive property of WaveNet and also maintains ordering of samples.

For training of 1 output sample, 5 input samples are used. Receptive field of this network is 5.

samples are denoted as x =( x 1 , x 2 . . . , xN), p(.) is the
probability.

The following equation is used for generation of new samples by predicting probability of next samples, given the probabilities of previous and current samples.

Problem with causal convolution is that they require many layers,
or large filters to increase the receptive field.

Dilated Convolution Layer

Dilated Convolution also referred as Convolution with holes or a-trous Convolution. In standard convolution (dilation =1), kernel varies linearly. It is equivalent to a convolution with a larger filter, wherein the original convolution is filled with zeros to increase the receptive field of the network. Stacked dilated convolutions enable networks to have very large
receptive fields with just a few layers, while preserving the input
resolution throughout the network as well as computational
efficiency. For training of 1 sample, total of 16 inputs are required, as compared to 5 in causal convolution.

Dilated Convolution with dilation rate of 2.

Each 1, 2, 4, . . . , 512 block has receptive field of size 1024, and can
be seen as a more efficient and discriminative (non-linear)
counterpart of a 1×1024 convolution.

The model gets struck where the input is nearly silent, as the model is confused about the next samples to be generated.

Gated Activation Units

The purpose of using gated activation units is to model complex operations. The gated activation units is represented by the following equation:

where ∗ is a convolution operator, . is an element wise multiplication
operator,σ(.) is the sigmoid activation function, k is the layer index
and W f , and W g , are weight matrix of filters and gate respectively.

Residual block and Skip Connections

The use of residual block and skip channels is inspired from PixelCNN architecture for images. Both residual and parameterized skip connections are used throughout the network, to speed up convergence and enable
training of much deeper models.

Overview of residual block and complete architecture.

Global and Local Conditioning

When WaveNet model is conditioned on auxillary input features (linguistic
feature or acoustic feature), denoted by h(latent representation of
features), it is represented as p(x|h)

Conditional Probability on auxiliary input features

By conditioning the model on other input variables, we can guide
WaveNet’s generation to produce audio with the required
characteristics.

WaveNet model is conditioned based on the nature of input in 2 ways : a) Global Conditioning, b) Local Conditioning.

Global conditioning characterizes the identity of speaker that influences the output distribution across all timesteps and it is represented by the following equations:

Where V ∗,k is a learnable linear projection, and the vector V f,k
h is broadcast over the time dimension.

Local features of speech represent the context of utterances and the
style of speaking of a speaker. Since wavelet also captures local features of a signal, hence the need for local conditioning is must. Local conditioning can be carried by either upsampling which is done using transposed convolution or repeated sampling. Local conditioning is represented by following equation:

Local conditioning for upsampling using transposed convolution

If no such conditioning is provided to the Network, the model produces gibberish voice.

Softmax Distributions

One approach to modeling the conditional distributions p(x t |x 1 , … , x t−1 ) over the individual audio samples would be to use a mixture model.

The reason for using softmax distribution is that categorical distribution is more flexible and can more easily model arbitrary distributions because it makes no assumptions (no prior) about their shape.

The generated samples are later converted into audio using μ-law expansion transformation, which is the inverse of μ-law compounding transformation.

Further reading for clear understanding of WaveNet, refer to this link (http://tonywangx.github.io/pdfs/wavenet.pdf).