GANs for the Beginner

Aditya Shukla
DataX Journal
Published in
9 min readFeb 26, 2021

--

We have seen computers beat grandmasters and World Champions at chess and Go; we have seen them paint exquisite artworks and compose beautiful melodies. Computers have successfully seeped into manufacturing, healthcare, defense, finance, and what have you.

In these things, we often see a simple-minded objectivity in the way that computers work. They always seem to calculate, compute, or imitate (as in artificial intelligence) rather than create. This quality of being able to “create” something out of nothing seems unique to human beings.

To create and to build are uniquely human pursuits. No other species practice the art of building things quite as we do. Moreover, this industrious spirit of humankind is precisely what has led to the rise of a discipline such as generative modeling in the field of AI.

As we inspect some of the brilliant (yet simple) ideas that constitute this field, I would like you to remember this beautiful fact: we are creatures who create, and so consequently, we want our computers too to be creative, in addition to being calculative.

A Simple Idea 💡

Generative Adversarial Networks, or GANs for short, are a subset of generative modeling. The original idea was outlined by Goodfellow et al. (who have become legends and rockstars in the AI community in their own right) in their 2014 paper introducing GANs.

In a GAN architecture, we have two models: a generator and a discriminator. It is the work of the generator to take some random noise and try and generate realistic samples from it. At the same time, it is the work of the discriminator to look at these samples (intermixed with some real samples from existing data) and distinguish between the two. We will intermittently show the discriminator some generated samples and some real samples, and its work is to tell us the probability that any sample given is real.

A GAN is called adversarial for this reason; two models are competing against each other. The generator is continually trying to develop better, more realistic samples (which are hopefully indistinguishable from the real ones) to try and fool the discriminator. In contrast, the discriminator tries to tell fake, generated samples from real ones successfully.

Generator

Note: the words training and real data are used interchangeably to refer to real pre-existing data samples. The words fake and generated data are also used interchangeably to refer to the samples created by our generator by passing in some noise and operating on it.

A neural network G(z, θ₁) can be used to model the generator, where G takes some random noise z and generates some samples. These samples are then accepted or rejected by the discriminator as real/fake and in this process, G learns a little more about what the real data’s distribution looks like, and subsequently tries to map the random noise it is fed to this training data distribution (i.e., the real data) x.

One can think of it as follows: the training data has some well-defined distribution, while the noise has some non-sensical distribution initially, and over time we tweak the parameters θ₁ of our G such that this random distribution is mapped to mimic the training data until it almost entirely overlaps with (i.e., maps to) it.

To put it simply, generator G’s role is to map the noise z to the desired data space x, and our goal is to learn the parameters θ₁ that will result in the correct mapping of the real distribution.

Discriminator

Conversely, a second neural network D(x, θ₂) estimates the probability that a given sample is real, i.e., the probability that it came from the training data.

Optimization Objectives

Our objective concerning optimizing the discriminator is as follows:

  • When we have a real sample (remember, x), we speak of D(x). Therefore, we want to maximize this probability D(x). In essence, we want to maximize the probability that the discriminator outputs when it is fed a real sample.
  • When we have a generated sample (remember, generated from noise z), then we speak of D(G(z)): as in the output of the discriminator on random noise passed through a generator. We wish to optimize the discriminator to minimize D(G(z)) and, hence, correctly classify a fake sample as not real.

Our objective concerning optimizing the generator is as follows:

  • We want our generator to generate data so realistic that D(G(z)) is 1, i.e., the discriminator is fooled into believing a fake, generated sample is REAL! Therefore, we will optimize to maximize D(G(z)) for training our generator.

The Game of Minimax

At this point, the idea of adversaries kicks in as the generator and discriminator try to optimize a value function in the opposite directions. The generator tries to make the discriminator guess wrongly (by generating realistic samples), and the discriminator tries continuously not to be fooled.

This push-and-pull or tug-of-war or sorts is called a Minimax Game in Game Theory and Computer Science. We do not need to bother ourselves with the minute details, so the essence is simply this: ours is a zero-sum game, i.e., one where the good performance of the generator is tied to the discriminator’s poor performance (and vice versa).

Therefore, there has to be some equilibrium where the two eventually end up ensuring the best possible performance by both the models without entirely jeopardizing each other.

In Game Theory, this is called the Nash Equilibrium. If you have a free evening, the cinephile in me would recommend you the film A Beautiful Mind, which is based on mathematician John Nash’s life, who first understood and outlined this idea. Not only did it grant Russell Crowe a Best Actor nom., but the flick also bagged Best Picture at the 2001 Academy Awards. Snazzy, huh? 🎬

Before we move ahead, let us pause for a moment, take a break, and breathe in what we’ve learnt so far. Meanwhile, you may adore this beautiful picture of the lush green hills of Puri, Odisha.

Puri, Odisha. Photo by Mohan Vamsi on Unsplash

The Cost Function

Before we address the cost function, I would like to outline the binary cross-entropy loss briefly.

To begin with, let us say we have a distribution P and a distribution Q. Our ML model has come up with Q in an effort to try and map P. Then, we can say that the Entropy for P is

Entropy for distribution P

for each datapoint in P. Now, if we try to map Q onto P (to see how well Q models P), we can find the cross-entropy as

Cross-Entropy for distribution Q

Taking the difference of the cross entropy and the entropy, we can find the Kull-back Leibler Divergence (or just KL Divergence) which is a measure of the dissimilarity between two distributions.

KL Divergence as a measure of dissimilarity of two distributions

This means the closer Q gets to P, the lower the KL Divergence will be.

And the whole point of our machine learning model is to come up with a good Q distribution that maps P well, and hence minimizes this dissimilarity!

If we somehow miraculously map Q onto P perfectly, then our cross entropy Eq will be equal to Ep and therefore KL Div. will be zero. However, that will probably not happen meaning our Q will always be a little imperfect in mapping P, so Eq will almost always be > Ep and we will always have some positive value for the KL divergence.

If we take this formula for each point in the distribution (as we do when we find the loss of a model), we get the following formula for Binary Cross-entropy:

Binary Cross-entropy Loss

All said and done, BCE is best suited for classification tasks. Therefore, after customizing the loss function further to our purpose (with the addition of the minimax optimization objectives), the researchers arrived at the following:

Loss for a generative~discriminative system

Let us understand each part of this equation.

  1. The term below represents the log of the output of the discriminator when fed with real samples from our training distribution x.

2. The term below represents the logarithm of the complement of the probability output by the discriminator when fed with fake samples generated from random noise z.

Here since we now have the log probabilities instead of just the probabilities, we can tweak our optimization objectives just a skosh.

When training the discriminator, we wish to maximize log (D(x)) as well as log (1 — D(G(z))).

When training the generator, we wish to minimize log (1 — D(G(z))).

PS: Try the math in your head; the optimization taking place is still the same!

Now that we’re armed with what exactly our loss function is measuring and how we’re planning on optimizing that, we have a pretty firm overall idea of how this system works. Let us check out some of the incredible stuff that GANs have been used to build so far!

Applications of GANs

GANs have been used in some very cool, and also some very unnerving ways.

  • Generating human faces — This is perhaps the most eerie of their applications. Not only have GANs gotten scarily good at generating human faces, but they have also become surprisingly adept at generating realistic videos or DeepFakes.
Jim Carrey as Jack Torrance from The Shining
  • Image Translation — Transferring the style of one sample to another. For example, translating images into cartoons, daytime photographs to nighttime photographs, black-and-white to color, and so on.
Day to Night Pictures from the paper Image-to-Image Translation with Conditional Adversarial Networks
  • Scene Generation — Generating realistic scenes such as everyday stills of bedrooms, pets, barn animals, etc. If I dare say so, this is by far the cutest application of GANs!
Images generated by BigGAN taken from the paper Large Scale GAN Training for High Fidelity Natural Image Synthesis
  • Super Resolution — GANs have also been used to obtain significantly more high resolution images when fed in blurry images. They also routinely outperform their other generative contemporaries such as variational autoencoders.
Taken from Single Image Super Resolution using GANs
  • Extrapolating datasets to create synthetic data — This is a somewhat boring application but very important nonetheless. With the help of GANs, we can create our own synthetic data: it will be different from the data we have so far but similar enough to be distinguishable.

An interesting and important conversation is raised when we look at some of the things GANs can be used for: from generating eerily realistic scenes to simulating public personalities and celebrities’ faces. The computer science industry needs to collectively engage in a discussion about how the lot of us want the World to use the tools that we build and the values with which we innovate and pave the way ahead.

The Future

GANs are a relatively nascent innovation in the field of Deep Learning. I am very excited to see how they will progress over the next few years and what real-world applications they will help power.

It would be unfair of me to try and fit everything I wish to talk about with GANs into one blog, and for that purpose, I will soon be publishing a part II to this blog. In it, I will walk you through the code for a simple Vanilla GAN that you can build and train yourself.

Until then, I would love to connect with you on LinkedIn and Github! Also, feel free to go through some of the cool stuff that I’ve been working on! 🤗

--

--

Aditya Shukla
DataX Journal

Just a human wandering around, thinking about things and appreciating the beauty of the World.