Understanding GANs with DeOldify

Bringing the pre 21st century black and white photos to life using GANs.

10 min readMar 28, 2019

If you are interested in diving into the code directly : clone it here.

About the Model( Credits : Jantic)

The goal of this model is to colourise the black and white images based on its interpretation of how the scenario might have been while clicking the picture.

Thoughts aside, How was this model actually built?

Technical Specifications:
1. Self Attention GANs
2. Progressive Growing of GANs
3. Two time scale update rule

This is a technical article involving lots of code. You can directly clone my implementation or read my article to setup DeOldify using Clouderizer.

Overview of GANs:

Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets namely Generator and Discriminator(aka Critic), trained one against the other (thus the word “adversarial”).

Typically, the generator is of main interest — the discriminator is an adaptive loss function that gets discarded once the generator has been trained.

We can think of GANs as a competition between 2 networks. Generative model tries to fool the discriminator while discriminator acts as a detective trying to catch the forgers. Since we don’t have a predefined function to determine which images are real/fake we use the discriminator model. It is usually dispensable after generator model is trained.

The standard objective function for adversarial loss:

From the above function, p(z) is the generator distribution to be learned through adversarial min-max optimization.

D → Discriminator
G → Generator
z → Noise Vector
qdata → data distribution

Both networks(D & G) need to be trained simultaneously. We don’t want to over train one model. There has to be a healthy competition.

How to train the model?

We can import modules and train GANs with just a few lines of code. Where’s the fun in that? Through this article, I am trying to explain the underlying concepts contributing to the beauty of GANs.

We need 2 models, a critic and a generator.

Model pipeline for training Generator and Discriminator(Credits:KDNuggets)

The generator model takes in the noise input from latent space and tries to generate fake samples. It fails initially but will eventually learn to generate low dimensional images which will be indistinguishable from real images.

We first define a Unet block to be used in the model. We can think of it this way, we normally use Conv2D → activation layer → Batch normalisation.

After every Convolution we add an activation layer to increase the dimensional flexibility and batch normalization layer to stabilise the weights of hidden layers. We are converting this process into a method which can be used directly in our model.

All the code is simplified for illustrative purposes(UNet Block)

Our UNet block consists of Conv2D →Upsampling →Activation →BatchNorm.

All the code is simplified for illustrative purposes(Conv Block)

The basic functionality of these modules is to extract features, increase dimensionality and normalize the weights before going to the next layers.

This is the ConvBlock class. We are adding 3- layers from pytorch:
1. Conv2D
2. LeakyRelu (Activation)
3. BatchNorm2d

All the layers are imported from the pytorch modules: nn.Conv2d, nn.LeakyRelu and nn.BatchNorm2d.

This ConvBlock is used to build the Unet Block which will in turn be used to build the Unet Model.

Similar to ConvBlock, the UpsampleBlock increases image dimensions by repeatedly increasing rows and columns of data. Batch Normalization normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. Hence stabilising the training process. Training images are taken from ImageNet.

Now we will make the Unet model based on the feature blocks built above.

Lots of moving parts in this Unet model. Let’s break it down into finer details.

model_meta = {resnet18:[8,6], resnet34:[8,6], resnet50:[8,6], resnet101:[8,6], resnet152:[8,6],    vgg16:[0,22], vgg19:[0,22],    resnext50:[8,6], resnext101:[8,6], resnext101_64:[8,6],    wrn:[8,6], inceptionresnet_2:[-2,9], inception_4:[-1,9],    dn121:[0,7], dn161:[0,7], dn169:[0,7], dn201:[0,7],}

model_data is just a dictionary containing all the models imported from here. Please don’t expect me to show the architecture of Resnet34! It can be downloaded from Resnet34 download and used directly with pytorch.

def cut_model(m, cut):    
    return list(m.children())[:cut] if cut else [m]

m.children() returns a list of layers which are cut out based on the training requirements. Through this method we remove the initial layers from training to stabilise the model from vanishing gradient problems.

Our generator model is done. Believe it or not! We’ll use Unet model for generator directly by calling,

netG = Unet34(nf_factor=2).cuda()

When we call our model Unet model all the layers as described in the method _get_pretrained_weights() are loaded. How did these methods get called? I actually hid a small part here. As per our code, Unet class inherits from AbstractUnet. Have a look at Abstract net class:

We are running those methods at class initialization itself.
Did all the parts converge now? This is the Generator model(Unet model).

Coming to the critic model. Inputs for critic are the real/fake images appearing randomly. The output should be 1 for a real image and 0 for a generated image. It should learn to separate real from fake images.

Critic model has basic feature extraction Convolutional layers with occasional Dropout layers to prevent over-fitting the model. (Binary Classification)

The code is pretty straight forward. We are adding Conv layers for the model to extract features of the image. The last layer is important though, ConvBlock(cndf, 1, stride=1) gives the output 0/1 for fake/real images.

To load the Critic model,

netD = DCCritic(ni=3, nf=256).cuda()

Think of ni & nf as training parameters for Convolution.

We have made the models! How to actually train them? This is where we include all the hyper parameters, constantly change learning rates, input image sizes and everything.

While training the generator critic stops and vice versa. We’ll start/stop the weights updation using set_trainable(True/False).

While training the critic, we pass images to the self.netD(image . The critic gives a result and we then run back prop. This is the snippet of training a critic:

orig_image, real_image are torch.Tensors.
to_np() : returns an np.array object given an input of np.array, list, tuple, torch variable or tensor.

The generator training is complex and requires huge explanation. This is the link to all the GAN training code.

Despite recent improvements GANs need tweaks throughout the training process for optimal results. We’ll use a class called GANTrainSchedule to schedule training hyper parameters like generator and critic learning rates, augmentations, the number of epochs with each configuration and all. We can think of this similar to training ConvNets where in the beginning we train all the layers and after a few epochs only train the final layers.

All the above details are not needed to train the models. The abstracted jupyter notebook for the above explanation. In this notebook we can get clarity about GAN training schedules and how it is possible to train GANs with zero knowledge.

SAGANs:

Most GAN-based models for image generation are built using convolutional layers. Convolution processes the information in a local neighborhood, thus using convolutional layers alone is computationally inefficient for modeling long-range dependencies in images.

Self attention exhibits a better balance between the ability to model long range dependencies and computational efficiency. In other words, a self attention layer decides the amount of exposure of the values to the next layer.

These are similar to attention models in NLP. If we are building a NLP model for language translation. Without attention models, we input the full sentence which is converted into another language. But how does this model perform with large piece of text. Will it remember long range dependencies? No.

A human translator read a small part of the sentence -> translates -> moves on to the next sentence i.e the translator is paying attention to a particular portion of the text at a time. This is the intuition behind attention models.

If x is a vector representing the features from the previous hidden layer. The features are transformed into two feature spaces f and g. Wf and Wg being the weights of the features spaces respectively.

This formula indicates the extent to which the model considers the ith location when processing the jth location.

Computing the whole attention span of the model,

The code for GANs presented above didn’t include the self attention layers. After building several models using self-attention at various parts, the conclusion:

“ The SAGAN models with the self-attention mechanism at the middle-to-high level feature maps achieve better performance than the models with the self-attention mechanism at the low level feature maps.”

The self attention parts were not included in the above code for the sake of simplicity. I’ll now explain briefly how to add self attention layers.

UnetBlock is the basic building block for our model. Hence, we add Self attention mechanism to UnetBlock.

class UnetBlock(nn.Module):
    def __init__(self, up_in, x_in, n_out, self_attention=False):
        super().__init__()
        out_layers=[]
        out_layers.append(*ConvBlock, Upsample, Activation layers*)         if self_attention:
            out_layers.append(SelfAttention(n_out))
        self.out == nn.Sequential(*out_layers)

Note: We introduced self_attention:bool parameter. This way we can pass self_attention=True for middle-to-higher level feature maps only.

Coming to the implementation of Self Attention module:
We need to first understand how to model gives attention to a particular part of the image. We’ll be implementing the above formulas only, but I would like to simplify them a bit.

We’ll take 3 vectors query, key and value. Value is the actual vector that we would pass without self attention. The matrix-matrix multiplication of query and key gives us a distribution(say attn) which tells us the amount the value vector that can be exposed to the next layer. Therefore, attn is then passed to softmax and the result is multiplied with value vector to give a final value vector to which attention needs to be given by the next layer.

Here’s the link to code of Self Attention module : Self Attention module. It is implemented exactly as explained above.

Progressive Growing of GANs

Our primary contribution is a training methodology for GANs where we start with low-resolution images, and then progressively increase the resolution by adding layers to the networks. This incremental nature allows the training to first discover large-scale structure of the image distribution and then shift attention to increasingly finer scale detail, instead of having to learn
all scales simultaneously. We use generator and discriminator networks that are mirror images of each other and always grow in synchrony. All existing layers in both networks remain trainable throughout the training process.
- Progressive Growing of GANs

The generation of high-resolution images is difficult because higher resolution makes it easier for the Critic to tell the generated images apart from training images thus amplifying the gradient problem.

The authors propose a solution to train GANs for improved Quality, Stability & Variation. The key takeaway is to grow both Critic & Generator model’s layers progressively: starting from a low resolution, we add new layers that model increasingly fine details as the training progresses.

Notice how we starting training with low resolution images(4X4 pixels). As the training advances, we add the layers in Generator and Discriminator gradually.

One change in DeOldify is that the number of layers remain constant while increasing the image resolution progressively.

/*Explain about training schedules in GANs*/ TK

Two Time-Scale Update Rule in GANs(TTUR):

Simply put, there is no guarantee that generative model will converge since the discriminator model serves as an objective(to the generator).

In general implementations of GANs when both generator and discriminator models were trained with same learning rate, the discriminator often learned faster than the generator. The main problem arises because the discriminator converges to a local minimum even when the generator is fixed. This paper proposes a separate learning rates for generative and discriminator models.

The authors hence prove that, GANs trained with TTUR converge to a stationary local Nash Equilibrium.

Nash Equilibrium is a proposed solution of a non-cooperative game involving two or more players in which each player is assumed to know the equilibrium strategies of the other players, and no player has anything to gain by changing only their own strategy.
-Wikipedia

In DeOldify, we have critic learning rate 5x of generator learning.

I posted a template on Clouderizer : you can check it out in this article.

Conclusion:

Hufff! That was long. But, the most important thing I want you to realise is GANs are not limited to generating images(I used to be under the same impression when first introduced). They can do much more than that. Now that I think about it, GANs must have a good understanding of their inputs(or the underlying vectors). That means with good training GANs can create whatever we feed them.

Footnotes:
Fast.ai : Deep Learning Course
GitHub Repo ( Credits : Jantic)
Self-Attention Generative Adverserial Networks
Progressive Growing of GANs
Two Time-Scale Update Rule in GANs
Improved Techniques for training GANs