Understanding WaveNet architecture
WaveNet is deep autoregressive, generative model, which
produces human-like voice, where raw audio is feeded as input to the model, taking speech synthesis to another level.
WaveNet is combination of two different ideas wavelet and Neural networks. Raw audio is generally represented as a sequence of 16 bits. 16 bits samples produces ²¹⁶ (65536) quantization values, which are processed through softmax, making it computationally expensive. Hence the sequences of samples is reduced to 8 bits, using μ-law transformation, F(x) = sign(x) ln(1+μ|x|)/ln(1+μ), -1 ≤ x ≤ 1, where μ takes value from 0 to 255 and x denotes input samples and then quantize to 256 values
The first step is an audio preprocessing step, after the input
waveform is quantized to a fixed integer range. The integer
amplitudes are then one-hot encoded. These one-hot encoded samples are passed through causal convolution.
Causal Convolution Layer
In signal and system, Causal system is system is referred as the output which depends on past and current inputs but not on future inputs. It is practically possible to implement causal system. In WaveNet the current acoustic intensity of the neural network is produced at time step t only depends on data before t.
This layer is the main part of architecture, as it signifies autoregressive property of WaveNet and also maintains ordering of samples.
For training of 1 output sample, 5 input samples are used. Receptive field of this network is 5.
The following equation is used for generation of new samples by predicting probability of next samples, given the probabilities of previous and current samples.
Problem with causal convolution is that they require many layers,
or large filters to increase the receptive field.
Dilated Convolution Layer
Dilated Convolution also referred as Convolution with holes or a-trous Convolution. In standard convolution (dilation =1), kernel varies linearly. It is equivalent to a convolution with a larger filter, wherein the original convolution is filled with zeros to increase the receptive field of the network. Stacked dilated convolutions enable networks to have very large
receptive fields with just a few layers, while preserving the input
resolution throughout the network as well as computational
efficiency. For training of 1 sample, total of 16 inputs are required, as compared to 5 in causal convolution.
Each 1, 2, 4, . . . , 512 block has receptive field of size 1024, and can
be seen as a more efficient and discriminative (non-linear)
counterpart of a 1×1024 convolution.
The model gets struck where the input is nearly silent, as the model is confused about the next samples to be generated.
Gated Activation Units
The purpose of using gated activation units is to model complex operations. The gated activation units is represented by the following equation:
Residual block and Skip Connections
The use of residual block and skip channels is inspired from PixelCNN architecture for images. Both residual and parameterized skip connections are used throughout the network, to speed up convergence and enable
training of much deeper models.
Global and Local Conditioning
When WaveNet model is conditioned on auxillary input features (linguistic
feature or acoustic feature), denoted by h(latent representation of
features), it is represented as p(x|h)
By conditioning the model on other input variables, we can guide
WaveNet’s generation to produce audio with the required
characteristics.
WaveNet model is conditioned based on the nature of input in 2 ways : a) Global Conditioning, b) Local Conditioning.
Global conditioning characterizes the identity of speaker that influences the output distribution across all timesteps and it is represented by the following equations:
Local features of speech represent the context of utterances and the
style of speaking of a speaker. Since wavelet also captures local features of a signal, hence the need for local conditioning is must. Local conditioning can be carried by either upsampling which is done using transposed convolution or repeated sampling. Local conditioning is represented by following equation:
If no such conditioning is provided to the Network, the model produces gibberish voice.
Softmax Distributions
One approach to modeling the conditional distributions p(x t |x 1 , … , x t−1 ) over the individual audio samples would be to use a mixture model.
The reason for using softmax distribution is that categorical distribution is more flexible and can more easily model arbitrary distributions because it makes no assumptions (no prior) about their shape.
The generated samples are later converted into audio using μ-law expansion transformation, which is the inverse of μ-law compounding transformation.
Further reading for clear understanding of WaveNet, refer to this link (http://tonywangx.github.io/pdfs/wavenet.pdf).