Discovering VAEs: How I Generated Images

Polina S
5 min readApr 12, 2022

--

You’ve probably heard about VAEs and GANs — generative models in ML that can produce synthetic data: images, music, etc.

If not, at least about the fact that ML models can create realistic images of people who do not even exist. Visit this site to see the results — here they use StyleGAN model.

Synthesized images by StyleGAN. https://syncedreview.com/2019/02/09/nvidia-open-sources-hyper-realistic-face-generator-stylegan/

I’ll try to discover and research the VAE models and make a short overview of how it works.

Disclaimer: This is my very first time with such an experience. I will tell you how I understand VAE and what results I have achieved. I will only provide a brief overview and strongly encourage you to explore additional resources such as this book.

What are autoencoders?

Typically, AE articles begin with an explanation of latent space, and this one is no exception. Latent space is the space of some specific low dimension used to represent complex objects (such as images) in a simpler way.

Images as discrete points of latent space

Autoencoders are two NNs of a certain organization and a latent space between them.

The first NN, called the encoder, compresses the image to a smaller vector, which is represented in the latent space, and the second NN, called the decoder, builds a reconstruction: it tries to generate an image as close as possible to the given one.

Autoencoder tries to generate image close to the input. Code is an image representation in latent space. Credits: https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798

Why latent space? It is much easier to represent and group objects in a lower dimensional space, which is why we use it here. While the original image is represented in a space of 3*width*height (if it is an RGB image), the latent space can have a dimension only about 10–1000.

In summary, how we use autoencoders:

  • training 2 NNs to teach them how to resize objects (for example, images)
  • send an image of a real object (an existing person, a building, a flower, etc.) and receive a synthesized image based on it (preferably similar to the given one)
  • generate new images from vectors that are in latent space

The encoder and decoder NNs are “functions” for scaling objects (in this article, images).

https://lilianweng.github.io/posts/2018-08-12-vae/

But that’s how standard AE works. Why do we use variational AEs and why are they called that?

VAE vs AE

First, let’s point out the problems with standard AE:

  • after encoding the image, you get the coordinates of a vector in latent space, which is a discrete value (vector of values). Imagine you compress 1000 pizza images and get 1000 pizza vectors in latent space. Now you want to generate a new pizza, for this you need to know the correct coordinates of the point from the hidden space (that is, such that a pizza is generated). But how can you know it? There are different ways: random point, mean squared value, weighted average etc. But you can never be absolutely sure that the result will be correct, unless you take a vector that represents a sample from the training set.
  • Let’s say you found some solution for generating pizza images, no matter what. But what if your AE is trained to generate objects of several classes: for example, images of trucks, food, animals? For simplicity, let’s say apples and bananas. After training the NN for 1000 apples and 1000 bananas, which point out of 2000 would you choose to create a new apple? And if you want to create something basically non-existent — a hybrid of an apple and a banana? Or what if you want it to be somewhere between an apple and a banana, but closer to a banana?

These problems can be solved with VAE. It is called variational because neural networks are trained to generate not discrete vectors, but distributions. Note that this already solves the problem of choosing a point for synthesizing a new object: if you know the parameters for the distribution of banana images, you can choose any of the different distribution vectors available.

Usually a simple distribution is chosen, such as the Normal Distribution. It has 2 parameters — Mean and Std, and these parameters need to be known in order to generate new apples, flowers, and papayas.

Another advantage of VAE over AE is regularization.

In standard AEs, the situation might be as follows:

Objects may not be distributed as you expect.

Say you want to generate an apple-banana (banana-apple — ?) and chose some point between them, but you can get a potato. You don’t know exactly how objects are distributed in latent space when using AE.

But what do we want to have and what VAE gives us:

“Images” are not discrete points, and distributed as it expected.

All possible banana-apples are somewhere between an apple and a banana, as they should be.

Changes have been made to the VAE design. The encoder outputs two vectors: one for Mean parameter and one for Std.

The decoder still has one vector as input, but this vector is a random point from a Normal Distribution with parameters.

Follow link for more detailed explanations: https://lilianweng.github.io/posts/2018-08-12-vae/

Demo

The NN architecture for the encoder and decoder is not limited to anything, from simple linear layers to complex combinations. I am using something like UNet: https://towardsdatascience.com/unet-line-by-line-explanation-9b191c76baf5. The dataset is “Labeled Faces in the Wild”.

And here are my experiments:

Images reconstruction

Left: original image. Right: reconstructed image.

Generating images based on random points

The result is something “average” among all images.

Generating new faces between two given

“Banana-apples” generating.

Images denosing

VAE trained on images with noise for their reconstruction.

Make people happy

Most right: original images of unhappy people. Middle: reconstructed images. Most left: reconstructed images but people are happy.

Change skin color

From right to left: original image, reconstructed image, reconstructed but with different skin color.

Conclusion

Hope this helped you understand the general idea of VAEs and the potential of generative models. At the same time, I will continue to work to get better results and more interesting experiments.

--

--