Generate Anime Character with Variational Auto-encoder

7 min readApr 7, 2018

The VAE Sampled Anime Images. The Github repository of this post is **here**.

People usually try to compare Variational Auto-encoder(VAE) with Generative Adversarial Network(GAN) in the sense of image generation. The common understanding is that VAE is more accessible to train and with explicit distribution assumption(Gaussian) for both latent representation and observations, while GAN captures the observation distribution better and does not have any assumption on the observation distributions. The consequence is that everyone believes only GAN can create clear and much vivid images. While it is true because, in theory, GAN captures the correlations between pixels, no much people tried to train VAE on images that more than 28x28 dimensional MNIST data to prove this.

So many VAE implementations only work on MNIST dataset, but barely has people do something else. Is this because the original VAE paper only used MNIST as example?

Mythbusters

Today, let’s do a “rumor breaker” implementation to see how non-acceptable a VAE image generator is. For example, the following image.

We start by looking for some GAN competitors. I searched “GAN Applications” on Google, and found a quite interesting Github repository that summaries some GAN applications. Why “GAN Application”? Well, it is hard to find GAN application is not image generation, isn’t it? To make this implementation more exciting, I am going to generate some Animes!

Let’s see how well a GAN model can do on this task first. The following two images are from [One] and [Another] repositories that doing anime generation and forked/stared a lot.

Not bad, isn’t it. I like the colors, they are really close the real images!

Although there are several ghost inside, this one is even better. I guess the trick is to zoom in the images to only look at faces.

It turns out GAN is impressively good. This makes me pressure on.

Hmm.. Should we continue..

Where to get Data?

There is no standard anime dataset available online, unfortunately. But this won’t prevent people like me to find one. After browsing through some Github repositories, I found several hints:

A Japanese website Getchu has tons of anime images.
Need some tools to download the images from the web, but you need to find one yourself. It may not be legal to provide one. 😜
There are lots of pre-trained U-net/RCNN anime face detector, such as lbpcascade_animeface so that you can extract the face to 64x64 images.

Variational Auto-encoder

I assume you already read a lot of posts about the Variational Auto-encoder. But in case you didn’t, there are some posts that I would like to recommend:

So, after you know what VAE is and how to implement it, the question is “If knowing the objective function and implementation is enough to train VAE?”. I thought the answer is yes, but it is not as simple as it usually is presented. For example, the question of where this objective function comes from and what the KL divergence component is doing here. In this post, I will try to explain the hidden facts of VAE.

Variational Inference is a technique that used in Probabilistic Graphical Model(PGM) for inference on complex distribution. Intuitively, it says if you could not capture optimal point of a complex distribution easily, you would like to approximate it from above(Upper bound) or below(Lower bound) using some simple distributions such as Gaussian. For example, the following figure shows how to use a Gaussian to approximate the local optimal solution.

Image is from: https://people.duke.edu/~ccc14/sta-663/EMAlgorithm.html

Please ignore the EM in the caption. It is a classic optimization method in PGM to update the variational lower bound, but now you use Stochastic Gradient Descent (SGD) on deep learning.

KL-Divergence is another very important technique used in PGM. It is used to measure the divergence between two distributions. It is not a distance metric because KL[Q||P] is NOT equal to KL[P||Q]. The following slide shows the difference.

Image from: https://www.slideshare.net/Sabhaology/variational-inference

Obviously, KL[Q||P] does not allows P = 0 when Q>0. In another word, when minimizing KL[Q||P], you want the Q distribution to capture some of the modes of P distribution, but you have the risk to ignore some of the modes. And, KL[P||Q] does not allow Q=0 when P>0. In another word, when minimizing KL[P||Q], you want Q to capture entire distribution and completely ignore the modes if needed.

So far, we intuitively understand two facts:

“Variational” is roughly “approximate” from above or below.
“KL” measures the divergence between two distribution.

Now let’s look back where the VAE objective function comes from.

My derivation of VAE. While it seems different with what you may read on the paper, this is the most understandable derivation I think.

Given some images as training data, we want to fit some parameters(theta) that can represent the training data as accurate as possible. Formally, we want to fit the model to maximize the joint probability of observations. Hence, you have the left-hand side expression.

Where is z come from?

z is a latent representation that creates the observation(Image). Intuitively, we assume some mystery artists create those images(x) in the dataset, and we call them Z. And, we found Z is not certain, sometimes the artist 1 creates the picture, and sometimes artist 2. The only thing we know is all of those artists have their particular preference on what they draw.

Where does the “greater or equal” come from?

Jensen’s inequality as shown in below. Note: log is concave, so, in our case, the inequality is inverse.

Image from Youtube: https://www.youtube.com/watch?v=10xgmpG_uTs

Why approx to the last line?

we couldn’t do integral on an infinite number of candidate z, so we use the numeric approximation, which means we sample from the distribution to approximate the expectation.

What is the distribution P(x|z)?

In Variational Auto-encoder, we assume it is Gaussian. That is why you do Mean Squared Error(MSE) when optimizing VAE.

f function is the decoder! Oops, there should be a square symbol after norm.

P(x|z) assumptions: Gaussian and Bernoulli. The code shows negative log likelihood, since we always want to minimize error but not maximize likelihood explicitly in deep learning.

The reason that you see so much softmax in Github is that, for binary images such as MNIST, we assume the distribution is Bernoulli.

What is the distribution of P(z|x)?

It is Gaussian. That is why you see the KL implementation is a close form solution. Don’t understand? No worry, see this.

KL close form expression in Python

How can this equation be an Auto-encoder?

There are two types of parameter in the equation. The theta is used to model P(x|z), which decode z to image x. And vartheta is used to model Q(z|x), which encode x to latent representation z.

Home made Variation Auto-encoder graph. Green and blue parts are differentiable, amber represents white noise and not differentiable. Every one use the famous cat image, so I use dog. 😄 I don’t know where I got this cute dog picture from. If you know, please tell me, so that I can refer to the original site correctly.

And corresponding Tensorflow graph:

Meaning of the two components of VAE objective function

Minimizing KL term: drag P(z|x) distribution to N(0,1). We want to generate the images by sampling through N(0, 1), so it is better to let latent distribution as close to the standard norm as possible.
Minimizing reconstruction loss term: create image as vivid/real as possible. Minimize the error between real image with the generated image.

It is easy to see that balancing those two components is very critical to let VAE work.

If we completely ignore KL term, the Variational Auto-encoder converges back to standard Auto-encoder, which vanish any stochastic part of the objective function. So, the VAE could not generate new image but only remember and display the training data(or create pure noise since there is no image encoded in that latent position!). The optimal result, if you are lucky enough, is kernel PCA!

If we somewhat ignore the reconstruction term, the latent distribution collapses into standard normal distribution. So no matter what the input is, you always get similar output such as

A collapse example from GAN. Same for VAE. Image from: http://yusuke-ujitoko.hatenablog.com/entry/2017/05/30/011900

Now we understand the trick:

We want the VAE generate reasonable image, but we do not what it to display training data.
We want to sample from N(0, 1), but we don’t want to see same image again and again. We want the model create very different images.

Then, how do we balance them? We set standard derivation of observation as a hyper parameter!

I saw too many cases that people directly set one value to the KL term such as 0.001*KL + Reconstruction_Loss, which is not standard VAE (Please checkout the conversation below this article)! By the way, is this the reason cause lots of people only do VAE on MNIST?

What else? The model complexity is critical to support the loss function. If the decoder is too complex, then even a weak loss cannot prevent it goes to over-fitting. The consequence is the latent distribution been ignored. Contrastively, if the decoder is too simple, the model cannot decode the latent representation reasonably and end up very blurry images only capture rough outlines such as the image we have shown before.

Finally, once we do things all correctly, it is time to see the power of VAE.