PyTorch Deep Learning Nanodegree: Generative Adversarial Networks
A fifth part of the Nanodegree: GAN
Generative Adversarial Networks
In this lesson we learn about various types of GANs and how to implement them. Also, we’ll work on a fourth project — generating faces.
Generative Adversarial Networks
The first lesson on GANs is lead by Ian Goodfellow, who invented GANs! It was very exciting to see him!
In the further lessons, he is presented as this cartoonish character!
Mnist GAN notebooks are available here: https://github.com/udacity/deep-learning-v2-pytorch/tree/master/gan-mnist
Deep Convolutional GANs
The second lesson is about DCGANs and covers some additional topics.
What is Batch Normalization?
Batch normalization was introduced in Sergey Ioffe’s and Christian Szegedy’s 2015 paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. The idea is that, instead of just normalizing the inputs to the network, we normalize the inputs to every layer withinthe network.
It’s called “batch” normalization because, during training, we normalize each layer’s inputs by using the mean and standard deviation (or variance) of the values in the current batch. These are sometimes called the batch statistics.
Beyond the intuitive reasons, there are good mathematical reasons to motivate batch normalization. It helps combat what the authors call internal covariate shift.
In this case, internal covariate shift refers to the change in the distribution of the inputs to different layers. It turns out that training a network is most efficient when the distribution of inputs to each layer is similar!
Benefits of Batch Normalization
Batch normalization optimizes network training. It has been shown to have several benefits:
- Networks train faster — Each training iteration will actually be slower because of the extra calculations, however, it should converge much more quickly, so training should be faster overall.
- Allows higher learning rates — Gradient descent usually requires small learning rates for the network to converge. Using batch normalization allows us to use much higher learning rates, which further increases the speed at which networks train.
- Makes weights easier to initialize — Batch normalization seems to allow us to be much less careful about choosing our initial starting weights.
- Makes more activation functions viable — Some activation functions do not work well in some situations. Sigmoids lose their gradient pretty quickly, which means they can’t be used in deep networks. And ReLUs often die out during training, where they stop learning completely, so we need to be careful about the range of values fed into them. Because batch normalization regulates the values going into each activation function, non-linearlities that don’t seem to work well in deep networks actually become viable again.
- Simplifies the creation of deeper networks — Because of the first 4 items listed above, it is easier to build and faster to train deeper neural networks when using batch normalization.
- Provides a bit of regularization — Batch normalization adds a little noise to your network. In some cases, such as in Inception modules, batch normalization has been shown to work as well as dropout. But in general, consider batch normalization as a bit of extra regularization, possibly allowing you to reduce some of the dropout you might add to a network.
- May give better results overall — Some tests seem to show batch normalization actually improves the training results. However, it’s really an optimization to help train faster, so you shouldn’t think of it as a way to make your network better.
Batch Normalization notebook is available here: https://github.com/udacity/deep-learning-v2-pytorch/tree/master/batch-norm
DCGAN notebooks are available here: https://github.com/udacity/deep-learning-v2-pytorch/tree/master/gan-mnist
GANs are not only used for image generation, they are also used to find weaknesses in existing, trained models. The adversarial examples that a generator learns to make, can be designed to trick a pre-trained model. Essentially, small perturbations in images can cause a classifier (like AlexNet or a known image classifier) to fail pretty spectacularly!
This OpenAI blog post details how adversarial examples can be used to “attack” existing models, and discusses potential security issues. And one example of a perturbation that causes misclassification can be seen below.
Other Applications of GANs
Consider this car classification example. From the abstract, researchers (Timnit Gebru, et. al) wanted to:
develop a computer vision pipeline to predict income, per capita carbon emission, crime rates and other city attributes from a single source of publicly available visual data. We first detect cars in 50 million images across 200 of the largest US cities and train a model to predict demographic attributes using the detected cars. To facilitate our work, we have collected the largest and most challenging fine-grained dataset reported to date consisting of over 2600 classes of cars comprised of images from Google Street View and other web sources, classified by car experts to account for even the most subtle of visual differences.
One interesting thing to note is that these researchers obtained some manually-labeled Streetview data and data from other sources. We’ll call these image sources, domains. So Streetview is a domain and another source, say cars.com is separate domain.
The researchers then had to find a way to combine what they learned from these multiple sources! They did this with the use of multiple classifiers; adversarial networks that do not include a Generator, just two classifiers.
- One classifier is learning to recognize car types
- And another is learning to classify whether a car image came from Google Streetview or cars.com, given the extracted features from that image
So, the first classier’s job is to classify the car image correctly and to trick the second classifier so that the second classifier cannot tell whether the extracted image features indicate an image from the Streetview or cars.com domain!
The idea is: if the second classifier cannot tell which domain the features are from, then this indicates that these features are shared among the two domains, and you’ve found features that are domain-invariant.
Domain-invariance can be applied to a number of applications in which you want to find features that are invariant between two different domains. These can be image domains or domains based on different population demographics and so on. This is also sometimes referred to as adversarial feature learning.
Ethical and Artistic Applications: Further Reading
- Ethical implications of GANs and when “fake” images can give us information about reality.
- Do Androids Dream in Balenciaga? note that the author briefly talks about generative models having artistic potential rather than ethical implications, but the two go hand in hand. The generator, in this case, will recreate what it sees on the fashion runway; typically thin, white bodies that do not represent the diversity of people in the world (or even the diversity of people who buy Balenciaga).
Pix2Pix and CycleGan
In this lesson the lecturer is Jun-Yan Zhu who is one of creators of CycleGAN.
As with any new formulation, it’s important not only to learn about its strengths and capabilities, but also, its weaknesses. A CycleGAN has a few shortcomings:
- It will only show one version of a transformed output even if there are multiple, possible outputs.
- A simple CycleGAN produces low-resolution images, though there is some research around high-resolution GANs
- It occasionally fails! (One such case is pictured below.)
Implementing a CycleGan
The notebooks for this lesson are available here: https://github.com/udacity/deep-learning-v2-pytorch/tree/master/cycle-gan
More generally, skip connections can be made between several layers to combine the inputs of, say, a much earlier layer and a later layer. These connections have been shown to be especially important in image segmentation tasks, in which you need to preserve spatial information over time (even when your input has gone through strided convolutional or pooling layers). One such example, is in this paper on skip connections and their role in medical image segmentation.
Least squares can partly address the vanishing gradient problem for training deep GANs. The problem is as follows: for negative log-likelihood loss, when an input x is quite big, the gradient can get close to zero and become meaningless for training purposes. However, with a squared loss term, the gradient will actually increase with a larger x, as shown below.
Least square loss is just one variant of a GAN loss. There are many more variants such as a Wasserstein GAN loss and others. These loss variants sometimes can help stabilize training and produce better results. As you write your own code, you’re encouraged to hypothesize, try out different loss functions, and see which works best in your case!
Project: Generate Faces
In this project our task was to use dataset with faces called CelebA and generate new faces.
In the notebook we resize the faces to 32x32 and this is how they look:
Not really nice, but still this will be a good exercise.
We are going to use GANs with
tanh activation function, so the first step is rescaling values of the images:
def scale(x, feature_range=(-1, 1)):
''' Scale takes in an image x and returns that image, scaled
with a feature_range of pixel values from -1 to 1.
This function assumes that the input x is already scaled from 0-1.'''
# assume x is scaled to (0, 1)
# scale to feature_range and return scaled x
low, high = feature_range
x = x * (high - low) + low
In order to make writing net architecture easier, we have this helper function:
def conv(in_channels, out_channels, kernel_size, stride=2, padding=1, batch_norm=True):
"""Creates a convolutional layer, with optional batch normalization.
layers = 
conv_layer = nn.Conv2d(in_channels, out_channels,
kernel_size, stride, padding, bias=False)
In allows to add convolutional and batch normalization layers easily. And a similar function for upscaling:
def deconv(in_channels, out_channels, kernel_size, stride=2, padding=1, batch_norm=True):
"""Creates a transposed-convolutional layer, with optional batch normalization.
# create a sequence of transpose + optional batch norm layers
layers = 
transpose_conv_layer = nn.ConvTranspose2d(in_channels, out_channels, kernel_size, stride, padding, bias=False)
# append batchnorm layer
After several attempts I decided to use the following Discriminator:
class Discriminator(nn.Module): def __init__(self, conv_dim=32):
super(Discriminator, self).__init__()# complete init function
self.conv_dim = conv_dim# 32x32 input
self.conv1 = conv(3, conv_dim, 4, batch_norm=False) # first layer, no batch_norm
# 16x16 out
self.conv2 = conv(conv_dim, conv_dim*2, 4)
# 8x8 out
self.conv3 = conv(conv_dim*2, conv_dim*4, 4)
# 4x4 out
self.conv4 = conv(conv_dim*4, conv_dim*8, 4)
self.conv5 = conv(conv_dim*8, conv_dim*16, 2, 1, 0)
# final, fully-connected layer
self.fc = nn.Linear(conv_dim*4*4, 1)
self.drop = nn.Dropout(0.1)def forward(self, x):
# all hidden layers + leaky relu activation
out = F.leaky_relu(self.conv1(x), 0.2)
out = F.leaky_relu(self.conv2(out), 0.2)
out = F.leaky_relu(self.conv3(out), 0.2)
out = F.leaky_relu(self.conv4(out), 0.2)
out = F.leaky_relu(self.conv5(out), 0.2)
out = out.view(x.shape, self.conv_dim*4*4)
# final output layer
out = self.fc(self.drop(out))
And this was my Generator:
def __init__(self, z_size, conv_dim=32):
super(Generator, self).__init__()# complete init function
self.conv_dim = conv_dim
# first, fully-connected layer
self.fc = nn.Linear(z_size, conv_dim*4*4*10)# transpose conv layers
self.t_conv1 = deconv(self.conv_dim*10, conv_dim*8, 4, stride=1, padding=0)
self.t_conv2 = deconv(conv_dim*8, conv_dim*4, 4)
self.t_conv3 = deconv(conv_dim*4, conv_dim*2, 4)
self.t_conv4 = deconv(conv_dim*2, conv_dim, 4, stride=1, padding=0)
self.t_conv5 = deconv(conv_dim, 3, 4, batch_norm=False, stride=1, padding=1)
self.drop = nn.Dropout(0.1)def forward(self, x):
bs = x.shape
# fully-connected + reshape
out = self.fc(x)
out = out.view(bs, self.conv_dim*10, 4, 4) # (batch_size, depth, 4, 4)
# hidden transpose conv layers + relu
out = F.leaky_relu(self.t_conv1(out), 0.19)
out = F.leaky_relu(self.t_conv2(out), 0.18)
out = F.leaky_relu(self.t_conv3(out), 0.17)
out = F.leaky_relu(self.t_conv4(out), 0.16)
# last layer + tanh activation
out = self.t_conv5(out)
out = F.tanh(out)
Training took a lot of time, sometimes net didn’t converge at all, sometimes it stopped too early. Here is an example of training losses during one of the training runs.
These are the images which I was able to generate as a result.
We can see that some images are distorted or unclear. Also some of them are too similar to training data. And here are some ideas to improve the results:
- The most obvious idea is to use more diverse data for training. We could try using images with peoples with more diverse ages, races and so on. Also simply increasing the volume of training data could help;
- For now, we trained using a network, which isn’t very deep. We could borrow some ideas from famous GAN architectures;
- Losses are very important to the quality of the net. We could try using some other kind of loss. One of advanced ideas is training with one loss at first and then switching to another. Also, we could try using a scheduler to gradually decrease (or change) learning rate.
This was the fourth part of Deep Learning Nanodegree. We learned how to write various architectures of GANs, studied their applications and listened to amazing people. Next part will be about deploying models of AWS Sagemaker!