Understanding Autoencoders (& VAEs) for data generation

Setting tone for GANs part-3

Mehul Gupta
Data Science in your pocket
7 min readJul 29, 2021

--

https://unsplash.com/photos/QNc9tTNHRyI?utm_source=unsplash&utm_medium=referral&utm_content=creditShareLink

So, we are already done with the basics of generative modeling & how Naive Bayes acts as a generative model & the reasons behind its failure on complex data.

So what to do for generating complex data, especially images?

Recollecting some important points from the past 2 posts:

Generative models should have an element of randomness. Hence, data generated shouldn’t be copies of training data nor from some 3rd world.

Naive Bayes failed for complex data as 1) Neighbourhood pixels have no mechanism to have an association with other neighbors. So every pixel is independently leading to a complete mess 2) No sense in determining specific ranges in a sample space that can generate relevant images & hence it remains unbounded for Naive Bayes. When this space is humongous (for say a 64x64 image where each pixel has 255 possible unique values), this become problematic as random points won’t help us generate meaningful images

Keeping this prior knowledge aside for a minute, let’s have a brief on what are autoencoders. They are basically a pair of Neural networks where:

  • An encoder network that compresses high-dimensional input data into a lower-dimensional representation vector. So you give it [1xN] or [NxN] input & it maps it to a, say, [1xM] output where M<N (usually)
  • A decoder network that decompresses a given representation vector back to the original domain. So you take an [1xM] input & converts it in [1xN] or [NxN] output

So, when used in combination, You

  • Input A to convert in B (encoder)
  • Then B is fed to convert back to A (decoder)

What’s the use? You missed an important point. For this, a story is a must

* You take a Paper sheet ‘A’

* Fold this paper sheet several times to produce folded-sheet ‘B’

* You unfold this folded-sheet ‘B’ to get ‘A’.

But would you get exactly what you fed? Nopes !!

The ‘A’ output from the decoder (unfolded) now have crease marks.

Can this output ‘A’ be called an exact copy of input sheet ‘A’? No, as output has crease marks while input had none

Is output ‘A’ similar to input ‘A’, though not the same? Hell yeah!

Can autoencoders be used for our data generation tasks? Sounds apt as

  • Can generate similar but not copies
  • Works well for complex data like images

But before moving on, how is all this happening? because of Latent space

So, when we convert an Input(N-dimensional) using Encoder to a smaller dimension(M-dimensional, N>M usually), this is something related to the concept of Dimension Reduction (we usually use PCA for) where this smaller dimension helps us

  • Keep important features of the input image/vector
  • Lose out minor features (as data loss happens when reduced to a smaller dimension. Similar to when you convert 3.145 to say 3.15, you lose out some information that is, though, minor)

Now, this reduced representation is fed to a decoder that tries to reproduce the original input (talking about vanilla ae). As we lost out on minor details, developing exact input is difficult & here, the decoder estimates minor features to complete the image bringing the random element in output (those creases in the paper sheet can be considered as such added elements). The reduced representation helps in generating the major features of the image.

This reduced representation is called the latent sample & the space to which this representation belongs is called latent space

So far so good. We have easily replaced Naive Bayes.

Or maybe not !!

We haven’t thought about how this encoder-decoder combo can be used for generating images. like what is the pipeline to follow

For this, we can remove the encoder once the combination is trained over some sample datasets. Now, we can use the decoder to generate new images. But,

How to generate the reduced representation/latent representation to feed to the decoder once the encoder is removed?

Can this be any random point in the latent space?

I can see two evident problems:

  • As we discussed in Naive Bayes, picking out any random point from the possible latent space for generating images is an awful idea as not all possible representations in latent space will generate meaningful images similar to the training dataset. We, somehow, need to know how to pick up the meaning of latent representation once the encoder is detached.
  • Also, it misses continuity. Observe the below image:
https://www.oreilly.com/library/view/generative-deep-learning/9781492041931/

We take an image, generate a 1x2 latent representation & regenerate a similar image using a decoder. Now, one thing missing in the entire mechanism is continuity. Assume, instead of [-2,-5] we would have fed [-1.9,-5.1] or [-2.1,-4.9] or something close to [-2,-5]? the regenerated image should, ideally, still look similar to ‘6’ I guess. But this doesn’t happen as we really don't have a mechanism to tell the decoder that if [-2,-5] produced ‘6’, its nearby points should produce a ‘6’ though, either with lower confidence or poor quality. This latent space has become discrete & not continuous my friend.

  • Also, if the training samples contain different types of objects from the same class (say if class=digits, different objects=1,2,3,4,5,6,7,8,9,0), the distribution in the latent space is at times uneven & some digits may dominate others. So, when picking up latent representation to generate images, we may have one digit (say 6) dominant over other digits in the regenerated images leading to an imbalance.

So, we still have some problems. To counter these, we have got

Variational Autoencoders

They can be considered an upgraded version of autoencoders specially devised to solve the above problems during data generation. A few notable changes were made compared to a vanilla autoencoder

  • Instead of mapping the input representation to a latent representation (point in latent space) directly, we now map the input to a Normal distribution & then to a latent representation.
https://www.oreilly.com/library/view/generative-deep-learning/9781492041931/

How? remember parametric modeling from my 1st post on Generative modeling. So, we map inputs to a set of parameters (mean & stddev) representing a Normal Distribution which in turn gets us a latent representation

  • We haven’t talked about loss functions yet. The vanilla autoencoder used reconstruction loss (summation of RMSE for each pixel). For VAEs, one more term is added called the KL Divergence loss which usually measures the difference between two distributions! So, for example, if you have got a Normal Distribution with mean=1 & stddev=2 and an exponential distribution with lambda=2, it will help us to know how different the two distributions are. More on KL Divergence can be read here.

But for KL Divergence, we should have 2 distributions to compare. Right? As discussed so far, we just observed one distribution, the one to which the inputs are getting mapped by the Encoder (as in 1st point). To whom are we gonna compare it? with Standard Normal Distribution (Mean=0, stddev=1). So KL Divergence calculates the difference between a Normal Distribution (Mean=x, stddev=y) & Standard Normal Distribution in VAEs. Why? will figure out below

https://www.researchgate.net/figure/In-a-a-Variational-AutoEncoder-VAE-scheme-with-the-mean-and-standard-deviation_fig1_339447623

So, how do these two additions help us overcome the problems we observed in vanilla autoencoders?

  • When we try to map the input to Normal Distribution rather than just latent representation, this gives us a way to get produce meaningful latent representations & no need of picking random points as samples belonging to this distribution should generate meaningful output images. Also, it solves the problem of continuity for us as the points now have a compact space to get mapped in latent space, hence, majorly continuous & the spread gets converged.
  • Also, as we add a KL Divergence term, this helps us to converge the above-mentioned Normal Distribution to Standard Normal Distribution(Mean=0, stddev=1). So, eventually, we aim to map input representation, through an encoder, to a Standard Normal Distribution.

Converging to Standard Normal Distribution helps us to

  1. To know the parameters of Normal distribution that can be used for generating latent representations. Now, any sample generated from Normal(mean=0,stddev=1) should get us to a meaningful, relevant image. Hence, random points picked from the distribution would give us relevant images
  2. This also helps in maintaining symmetry (even spread for all objects of a class, I myself need to find out how)

So, the flow is something like this

1. Encoder encodes input in 2 vector: mean & log_variance

2. Input is mapped in latent space using the formula

Z = mean + exp(log_var/2)*epsilon

Where epsilon= random point sampled from standard Normal distribution

3. The decoder takes Z to produce original Input

Loss = RMSE + ( -0.5 * sum(1 + log_var — mean^2 — exp(log_var)))

And with this, we are finally done with the pre-basics required for GANs. We will start off with GANs in my next !!

--

--