Introduction to Reversible Generative Models

Nolan Kent

Published in

AI/ML at Symantec

13 min readFeb 12, 2020

How “Glow” produces high-quality images with change of variables

Animation of the coupling layer and its inverse, fundamental to recent advances in flow-based models.

Preface

I adapted this blog on flow-based models from a technical presentation I gave after reimplementing the ‘Glow: Generative Flow with Invertible 1x1 Convolutions’ paper from OpenAI as a personal project. The blog consists of two parts:

Introduce a type of generative model that uses change of variables. These can also be described as flow-based models, reversible generative models, or as performing nonlinear independent component estimation.
Explain the ‘coupling layer’ introduced in “NICE: Non-linear independent component estimation,” which is fundamental to recent advances in these models.

My goal is to help a reader familiar with machine learning brush up on the background necessary to understand the Glow paper, which has shown amazing results. To get an idea of what Glow can do, check out this blog post and play around with the tool:

Glow: Better Reversible Generative Models

We introduce Glow, a reversible generative model which uses invertible 1x1 convolutions. It extends previous…

blog.openai.com

This model has shown remarkable success on high-resolution images, and the following two papers introduced many of the concepts it uses:

NICE: Non-linear Independent Components Estimation

We propose a deep learning framework for modeling complex high-dimensional densities called Non-linear Independent…

arxiv.org

Density estimation using Real NVP

Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically…

arxiv.org

Here’s my quick and dirty reimplementation of Glow using tensorflow:

nolan-dev/glow_reimplementation

This is a reimplementation of https://openai.com/blog/glow/ I did as a learning exercise This version is very messy…

github.com

Part 1: Introduction

For context, here are some common goals of generative models:

Model the data distribution so that we can take samples and evaluate densities
Encode data as latent variables and decode the latent variables to exactly reproduce the original data
Use latent variables that encode meaningful information useful for downstream tasks
Produce subjectively high-quality data

Figure 1: Image to latent to image example. Face from https://openai.com/blog/glow/

Most generative models do not support all of these goals. However, flow-based models like Glow do, with some caveats. Unlike several other popular approaches, they attempt to make finding the density function tractable.

Making the density tractable

Figure 2: Image modified from https://openai.com/blog/glow/

The exact definition is fuzzy, but usually, generative models attempt to model the distribution they are trained on in a way that allows sampling from it (generation). For GANs, this just means learning to sample from that distribution. Variational autoencoders allow sampling but also attempt to approximate the probability density function by maximizing the evidence lower bound. Autoregressive models and change of variables models, on the other hand, attempt to make finding the density function (or rather, the best density function in their hypothesis space) tractable by maximizing the likelihood of the training samples.

The following image from Ian Goodfellow’s NIPS 2016 tutorial illustrates how conventional generative models relate to each other in the context of maximizing the likelihood of a density function:

Figure 3: Image Credit: Ian Goodfellow https://arxiv.org/abs/1701.00160. This blog focuses on change of variables models

Here’re some examples of how we might calculate the density function:

Say we have a multivariate distribution over a space H, with samples from that distribution h consisting of independent components h_d. We can calculate the density at h by multiplying the density of each component together:

For example, H might consist of two specific attributes of the face which we assume are independent: h_1 indicates if the face is smiling, h_2 indicates if the face is wearing glasses. To get the probability of a specific face that is wearing glasses and not smiling, we can multiply p(h_1=not smiling)*p(h_2=wearing glasses). The independence assumption reduces the problem to finding the density functions of the components.

Unfortunately, the components we have access to often aren’t independent, so (to avoid naivety) we have to incorporate the dependencies using the chain rule of probability:

Continuing with the previous example, if our dataset consists of before-and-after images of people trying on glasses for the first time, it may be that a person is more likely to be smiling if they are experiencing better vision for the first time (wearing glasses). Therefore, to get the probability of a specific face that is wearing glasses and not smiling, we multiply p(h_1=not smiling|h_2=wearing glasses)*p(h_2=wearning glasses). This calculation incorporates the dependency between the two components.

In the case of images, the probability distribution is over all images X, and we want to calculate the density at a specific image x. Calculating the density at an image requires accounting for the dependencies of the image’s components: the pixels. For example, pixel 2 in an image is usually very dependent on pixel 1. Except for borders, pixels that are close together tend to be part of the same object and are therefore highly correlated. Borders make up a small portion of a natural image, as they are nearly 1-dimensional features in a 2-dimensional space. On the other hand, a latent variable like ‘wearing glasses’ is often independent of a latent variable like ‘smiling.’

Nonlinear independent component estimation, when applied to images, aims to transform the dependent distribution over pixels x to an independent distribution over latent variables h by applying a function f(x). The Glow paper, its predecessors, and this blog are mostly about finding the right function.

f(x) must be invertible so we can go back to X from H (otherwise, we can’t generate samples). The invertibility requirement forbids lossy dimensionality reduction: the latent space must have as many components as the image space (usually height*width*3 color channels). Figure 4 roughly shows the concept of going from a complex distribution to a simpler one and vice versa.

Figure 4: Loose, informal illustration of converting from a complex, dependent distribution to a simpler, independent distribution. Note that most distributions don’t look like snakes.

Once we have a function f(x) to map to from samples of X to an independent space H with a known probability density function, we could try to calculate the density of a specific image x using the following naive approach:

If you’re as rusty with change of variables as I was when I first read the Glow paper, step 4 may seem intuitive. However, it’s not entirely correct due to how continuous distributions work. I’ll go over why in the next part. If you are familiar with change of variables for integrals, it will be mostly review.

Change of variables

As a simple example, say X is a univariate distribution over beard size instead of pixels. In the illustration below (Figure 5), a light beard is most probable, with larger beards or no beards having lower probabilities. Following step 2, we can apply a function to convert X to H: h = f(x). Say f(x)=x/10. It doesn’t make sense to talk about independent variables when there is only one; this example is meant to illustrate a basic change of variable that can then be extended to include multivariate distributions. Figure 5 and 6 show the before and after of this transformation, with a brown curve indicating the probability density function displayed above the images:

Figure 5: Modified images of Dr. Geoffrey Hinton generated at https://openai.com/blog/glow/ using a pre-trained Glow model

Apply h = x/10.

Figure 6: After applying h=x/10, probability density at h=0 is greater than the probability density at x=0. Images are clearly ‘denser.’ The height of the curve increases with the density of images.

The image that was previously at 10 is now at 1, and the image that was at -10 is now at -1. The image that was at 0 is still at 0, but it is clear that the probability density is higher at -1, 0, and 1 in the new image than it was at -10, 0, and 10 in the old image, just as the density of images has increased.

The area under the curve of a probability density function must always be 1, so lowering the width must increase the height and vice versa. This concept can be expressed more mathematically using the cumulative density functions F, which are calculated as the integral of the probability density function (pdf) from negative infinity to the point under consideration. The fundamental theorem of calculus indicates that the pdf is, therefore, the derivative of the cumulative density function. After we change from x to h = f(x), we need to use the chain rule to calculate the new pdf. The following equation gives the actual pdf for x in terms of the pdf for h:

Therefore the process should be the:

Therefore the density at x can be calculated as

This StackExchange post includes answers with a good explanation for why we use the absolute value of the determinant.

For me, knowing that each step in an equation is mathematically valid does not always allow me to understand that equation intuitively. The following MSPaint diagram roughly demonstrates how the height of the new pdf needs to change to match the change in the ‘width’ of the space. My favorite youtube channel, 3Blue1Brown, has a video that superbly animates derivatives as stretching or squishing of space https://www.youtube.com/watch?v=CfW845LNObM&t=3m7s, which is an essential intuition for understanding change of variables.

Figure 7: Demonstration of change of variables. The original distribution (black curve, above) is mapped to a new distribution (red curve, below) using f(x). The lines that show the mapping also show the derivative of f(x) around the points that the function maps: if two points are farther apart in the new space, then the slope between them is greater than 1, if they’re closer together, then the slope is less than 1.

Finding the function

I haven’t mentioned yet why we need the exact density function to generate samples (clearly, we need the density function if we want to calculate densities). GANs, for example, only use it implicitly. Up to this point, I’ve assumed the function that maps pixels to independent components is known, but this won’t be the case for a dataset with non-trivial images. Now, machine learning comes in: we can have a computer learn the function instead of writing it ourselves. With the density function, learning the parameters of f(x) can be done via maximum likelihood estimation. Specifically, we can maximize the log-likelihood of f(x)’s parameters θ to produce a distribution with independent components p_H for input x

If we choose the independent distribution to be a Gaussian with mean 0 and variance σ², the equation becomes

Without the log determinant term, this is equivalent to minimizing the mean squared error between the output of the function and 0. In this case, the best θ is the one that causes f to map every value to the mode of the distribution (0), as illustrated in Figure 8. This mapping produces a Dirac delta distribution rather than a Gaussian:

Figure 8: All points mapped to the mode of the target distribution due to the small slope, producing a Dirac delta distribution when slope converges to 0.

Here’s another way to understand how the log determinant term is required to find the right function:

If we want to map to a 0 centered Gaussian, the log-likelihood loss we minimize is f(x)²/2σ² (log of the Gaussian density equation with 0 mean). Without the log determinant, this means a function that maps all training data to 0 would have 0 loss (bad: Gaussians should have some variance). The function cannot do this if we maximize its slope near the training data: nonzero slope means not all points are mapped to the same location. Given this, it seems intuitive that the log determinant term should be related to variance in some way. Often when we maximize the MLE of a Gaussian, we can disregard constant factors such as the Gaussian’s variance; however, in this case, we cannot because the log determinant term is not multiplied by the same constant. Because the log-likelihood of a Gaussian is the mean squared error divided by 2σ², variance is effectively a weight term, with higher variance increasing the influence of the log determinant term by lowering the influence of the log density term. For the log determinant term to disappear completely, the variance would have to be 0, giving infinite weight to the log density term, which would produce the Dirac delta type pdf shown in Figure 8.

The next section on the “coupling layer” explains how we might create and invertible function with a tractable Jacobian determinant that is powerful enough to map natural images to their independent components.

Part 2: Coupling Layer

Hopefully, the previous section made it clear that we can create a generative model by using an invertible function with tractable Jacobian determinant to map the input data to an independent representation. Given the complexity of some types of input data (images), it would be challenging to create that function manually. If the function is differentiable and parameterized by a sufficient number of variables, we may be able to use an optimization process to learn it. Neural networks have shown success at learning complicated differentiable functions, but they are generally not invertible with a tractable Jacobian determinant. The coupling layer fulfills all of these requirements: the layer is invertible, has a tractable Jacobian determinant, is differentiable, and incorporates a neural network powerful enough to handle complex input. It achieves this with a ‘triangular’ function structure, where we split the inputs and outputs into two parts. The first layer in the network needs to perform the initial split into x_1 and x_2, which is usually done across channels or with a spatial checkerboard pattern, but to simplify the illustration, Figure 9 shows an example of splitting along the height

Figure 9: Splitting an image to get x_1 and x_2

The inverse of this specific split is to combine the top and bottom halves into a single image. At the start of a coupling layer, “Key” data is passed through a neural network and combined with “Message” data. The “Key” is also passed along unmodified, which means we do not need to invert the neural network, as the next layer has a version of the network’s input. Coupling layers can be stacked to build more powerful functions that are still invertible with tractable Jacobian determinants.

Figure 10 shows a diagram of a coupling layer, the input is x_1 and x_2, and the output is the ‘new key data’ and ‘new message data.’ The neural network, m, has influenced half that data. By stacking these layers, we can get a ‘deep’ representation.

Figure 10: Coupling layer (plus mix) used to build powerful invertible functions with tractable jacobian determinants. One frame shows the coupling layer, and the other shows its inverse.

g is an invertible function combining x_2 and the output of m(x_1) (m can be a neural net). The diagram is equivalent to the following equations (without the mix function) which also demonstrate the invertibility of the layer

The triangular nature of the Jacobian for these equations means only the partial derivative of y_2 with respect to x_2 is needed, which does not require differentiating the neural network m (keeping the calculation simple)

Therefore, by picking the right g and mix functions, we can get an invertible function with a tractable Jacobian determinant even if m is highly complex.

g is often defined as the following simple function, to produce an ‘affine coupling layer’:

The circle with a dot in the middle used to combine x_2 and m_2(x_1) is the Hadamard product, which is an element-wise multiplication. Backpropagation also uses the Hardmard product.

y_1 and y_2 are ‘mixed’ with an invertible transformation. Without mixing, the “Message” data will never be passed through a neural net, and the network won’t learn a deep representation. The diagram below shows two options for this transformation

Figure 11: NICE (top) vs. Glow (bottom) mixing layers

The first example is what the original paper did: the ‘key’ and ‘message’ data swap position. The inverse is trivial, swap them back.

The second example is closer to how the ‘Glow’ paper implements the mixing operation. The paper points out that the ‘swap’ operation is a specific instance of changing the permutation of channels. It uses a more general permutation implemented as a square matrix transform with learnable weights, initialized as a rotation matrix. This transform is where “Invertible 1x1 Convolutions” from the title of the paper comes in: one way to do a channel-wise ‘permutation’ is 1x1 convolutions. 1x1 convolutions are convolutions with filter height and width of 1 and have been used in architectures such as the inception network to change the number of channels. In Glow, the number of channels must stay the same for the architecture to be invertible (unless we also alter the spatial dimensions to keep the total number of components constant).

Conclusion

I think an important takeaway from this type of model is that we shouldn’t assume the restrictions imposed by a particular approach are insurmountable. I can see myself assuming that the requirement for an invertible function with a tractable Jacobian determinant excludes the use of neural networks. This assumption turns out to be incorrect, and the solution is an effective approach to model the independent components of high dimensional data. The architecture is still restricted, however, which means models like GANs that have more flexibility have been able to produce high-quality results with much less training time and memory usage. I’m also interested in gaining a better understanding of how generative models behave and how they might learn useful concepts in an unsupervised way. I discussed approaches to this in previous blogs https://towardsdatascience.com/animating-ganime-with-stylegan-part-1-4cf764578e and https://towardsdatascience.com/animating-ganime-with-stylegan-the-tool-c5a2c31379d in the context of GANs.