Understanding Normalizing Flows and Its Use Case in Speech Synthesis (Part 2)

Sanjay G
Subex AI Labs
Published in
5 min readJun 30, 2021

Welcome to the second part of the article on Normalizing Flows and its application! In the first part, we discussed what normalizing flows are along with the math behind it. Check out the first part if you haven’t already.

In this part, let’s see how we can use this concept to generate speech. Specifically, let’s look at a Text-to-Speech (TTS) system which usually has two networks. The first network — Feature Prediction Network (FPN) — takes in the English text(s) and predicts an intermediate feature representation like a spectrogram. The second network — Audio Generation Network (AGN) — uses this spectrogram to generate the audio samples of the speech (the speech for the English text provided as the input). Our focus is on the AGN.

There have been many deep network architectures proposed for the AGN, but here let’s consider WaveGlow.

WaveGlow uses the concept of normalizing flows to obtain the audio samples from the samples drawn from a known distribution — standard Gaussian distribution in this case. The general idea is that, given samples from a known distribution, the samples from the desired distribution can be obtained by applying a series of sufficiently complex invertible transformations to the samples from the known distribution.

Here, the information about the spectrogram is included within the transformation that the samples undergo. Speaking in terms of probability theory, we predict audio samples from the input samples conditioned on the spectrogram.

Let ‘x’ and ‘z’ represent audio samples and standard Gaussian samples, respectively related by a sequence of ‘K’ invertible transformations ‘f’. The WaveGlow model is trained by maximizing the log-likelihood of ‘x’ obtained directly using the change of variables formula:

In the WaveGlow network, the invertible transforms consist of two layers viz Coupling layer and Invertible 1x1 convolutional layer. Let’s look at each of these layers separately and later see how these are combined in the WaveGlow architecture.

Coupling layer

The input vector ‘x’ to the coupling layer is split into two halves and the forward transforms are applied on each of these two halves as described by the equations below. The corresponding inverse transformations are also shown.

The first transform is an identity transform and the other one is an affine transform. The parameters s and t are obtained from another function ‘m’ (described later) which takes input as the first half of ‘x’ and the mel-scaled spectrogram corresponding to the input audio samples. The inverse transforms are possible because xa = ya which results in the same values for s and t.

Invertible 1x1 convolutional layer

This is essentially a matrix transformation. The corresponding forward and inverse transforms are given below:

The matrix W is initialized to be an orthonormal matrix and QR decomposition is applied on W for easy computation of its inverse.

WaveGlow architecture

Now that we know both layers of invertible transforms we can easily comprehend the block diagram of WaveGlow shown below.

During the training phase, ‘x’ is the input (audio samples) and ‘z’ is the output. We’re mapping the samples in the audio space to the standard Gaussian space. The input goes through an invertible 1 x 1 convolutional layer and then the resulting output is fed as input to the coupling layer. This operation is called a ‘flow’. In total, there are 6 flows. The output of the coupling layer of the last flow is ‘z’.

The function m another deep network shown below:

As mentioned before, it takes the first half of the split in the coupling layer and spectrogram as the input to generate s and t values. These inputs essentially go through separate dilated convolutional layers followed by gated activation functions (a combination of tanh and sigmoid). This output then is given to a 1 x 1 convolutional layer with ReLU activation to get s and t.

Now the final equation of log-likelihood to maximize becomes:

Inference

Finally, once the training is complete, we’re ready for inference. During the inference phase, we use the inverse transformation. We’re mapping samples from the standard Gaussian space to the audio space.

We simply feed in the spectrogram predictions for the given English text(s) from the FPN along with a randomly sampled vector from standard Gaussian to the WaveGlow network. And there you go, the Gaussian samples get transformed into audio samples conditioned on the spectrogram information. And now you have the speech for the input text given!

I hope now you know what normalizing flows are and you got a fair idea of how they could be applied in deep learning use cases.

Happy learning, Cheers!

--

--