Face-Morphing using Generative Adversarial Network(GAN)

Published in

The Startup

7 min readNov 4, 2019

It was only recently that I started exploring the full scope of Deep-Learning and came across these interesting ideas and projects in Computer Vision.

Even if my knowledge and my experience is limited, I hope this may help some of the other beginners to take interest in the field and try new exciting things

I came across a brilliant YoutTube channel by the name of Arxiv Insights (OR AI for short. Coincidence? I think not) and on this channel I found one of the videos to be quite interesting on Learn how to morph faces with a Generative Adversarial Network!. This story is a summary of the knowledge I gained from the said video and I hope by the end of this you will have a good understanding of the idea and may want to play around with it.

The people in the image above DO NOT EXIST in real life, they are computer-generated and that, my friend, is the power of GAN. If this has caught your attention keep reading to know more about it.

PART 1. GAN: What Is It?

GAN has a very simple task to do, that is, to generate data from the scratch, data of a quality that can fool even humans.

Invented by Ian Goodfellow and colleagues in 2014, this model consists of two Neural — Networks(Generator and Discriminator) competing with one another resulting in the generation of some authentic content.

The purpose of two Networks may be summarised as to learn the underlying structure of the input database as much as possible and using that knowledge to create similar content which fits all the parameters to fit in the same category.

As shown above, the input was that of human faces, where it learned exactly what it is that makes a human face, well, human. Using that understanding it generated random human faces which otherwise might have been real as well.

Let’s understand a bit more about it in detail:

This image is an oversimplified architecture of GAN, but it captures the complete essence of the concept.

This is what happens in a single iteration of GAN:

I. Generator:

Generator gets a random noise vector as input
After passing through the generator which performs multiple transposed convolutions to upsample the noise to generate the images.

II. Discriminator:

It gets random input from either the Real Word Sample(Real Sample) or Generated Images(Fake Sample).
As the name suggests, it has only one job, whether the input was from “Real Sample” or “Fake Sample”

As users, we know if it was from the real or fake sample, and using this knowledge we can backpropagate a training loss in order for the discriminator to do its job much better.

But as we know, the Generator is a Neural-Network as well, so we can backpropagate all the way to the random sample noise and thus help generate better images. By doing this, the same loss function works for both, the discriminator and the generator as well.

The trick lies in balancing both of these networks during training. If done rightly, the discriminator will learn to distinguish even slight abnormalities while at the same time generator will learn to generate the most realistic outputs.

Technical Understanding of the Working of GAN:

The Generator and the Discriminator are in a mini-max game.

Generator is trying to minimize the gap between the real and fake images so as to fool the discriminator.
Discriminator is trying to maximize the understanding of real images so as to distinguish the fake samples.

In the above image, D(x) is nothing but the probability of an image being a “Real Sample” image.

Here there is another function G(z), which is nothing but the output of the Generator, z the random latent input. The probability of the generated image is from “Real Sample” is calculated by Discriminator as D(G(z))

For Discriminator we want:

Real Sample images to be rightly identified, and so, D(x) must be close to 1
At the same time, Fake Sample images to be correctly identified as well, and so, D(G(z)) must be close to 1

For Generator:

Generator has no business with the accuracy of D(x), only D(G(z)) which must be identified as a Real Sample, and so, must be as close to 1 as possible.

This loss function is the backbone of the GAN Architecture, only by achieving a great balance between the two networks we get high-performing Generator and Discriminator.

For those of you, who are interested in learning more about GAN in detail:

Here is the link to the original paper by Ian Goodfellow.
Here is a link to “GAN — GAN Series (from the beginning to the end)”, one of the best sources for a deeper understanding of GAN and its applications by Jonathan Hui.

PART 2. The Fun Part

The Principle Behind This Model:

After a Generator model has been trained, its latent space has fully learned the underlying structure of the Dataset
In our example, the model we will be using has learned the structure of the human face. The model is StyleGAN, developed by researchers at NVIDIA.
Our objective is to leverage this structure and manipulate it for our fun.

You should know, that manipulating images in the pixel domain is much too tedious and difficult work, so instead, we will be playing with images in the latent space.

Here comes then, our first obstacle, How? for any given image, can we find the latent vector which will always result in our query image? That is:

The Process:

For our first obstacle, the following solution works best:

Generate random faces via Generator
Using these images as a dataset, train a ResNet to go from the source image to their latent vector code(rough initial estimate)
We will be using a pre-trained ResNet that can find the latent code of Query Image(rough estimate)
Then, this image is taken as the starting point, to compute L2 loss with respect to “Original Image” and the latent vector code is updated accordingly(while the weights of the generator itself are fixed)

Here’s a video of the Second Part — Updating Latent Code Estimate :

In the later part of the video, the changes are very less observable that is due to the latent code estimate converging to the query image’s code.

PART 3. IT’S MORPHING TIME

OK, maybe not this one.

Set-Up:

We need another dataset, and we generate random faces database again
We apply a pre-trained attribute classifier to get attributes such as: “gender”, “age”, “Smile”, etc.
This is done so we can map latent code to image attributes to find a pattern.

We need to understand, the latent space of StyleGAN is a highly complex 512-Dimensional space.

Here, every point represents a picture, and we need to find a pattern in this space. For example, how will moving along a certain direction in this space change the generated image?

It can be observed, that these attributes can be quite easily separable by a “linear-hyperplane” in this latent space.
And taking a normal along this plane will give us the direction to move along to change that attribute.