Recently I participated in a contest on Kaggle in which the task was to train a Generative Adversarial Network (GAN) on the Stanford Dogs dataset to generate new unseen images of dogs. This was the first online contest that I ever participated in.
GANs have been in trend from the time they were introduced and have also produced some amazing results. I had been reading about them and also implemented some simple architectures on MNIST dataset to test my knowledge but never got a chance to dive deep into them and get a better understanding. So when I came across this contest, I knew I had to participate and after entering the contest I learned a lot of things about GANs
During the contest, I experimented a lot with the architectures, hyper-parameters and tried out various techniques posted by other participants to improve the results. In this post, I’ll be sharing some of the techniques that worked for me.
Upsampling vs Transposed Convolution :
The generator network takes in random noise and performs operations on it to generate new images. For these operations, we can either use Upsampling layers or Transposed Convolutions layers (the explanation of how these layers work is beyond the scope of this article). But which layer produces better results? There is a very nice article on this topic here, the gist of which is that using the Transposed Convolution layers produce images with a checker board (grid) effect while using Upsampling layers produce more smooth and natural looking images.
The generator takes in noise as input and to my surprise, the size of the noise vector also affected the results. Initially, I was using 100 as the size of noise vector but when I changed it to 128 and 256 my results got better and the training speed also increased.
The biggest mistake that I was doing initially when I started was to use a very big value for batch size. While training your GAN use a batch size smaller than or equal to 64. Using a bigger batch size might hurt the performance because during the initial training the discriminator might get a lot of examples to train on and it might overpower the generator, which would have a negative effect on training. For me, using batch sizes of 8 and 16 worked out best.
Loss function plays a huge role in the training process and researchers have been experimenting with various custom loss functions to be used in GAN training. I started out with a simple
binary_crossentropy loss function which gave out decent results. Then I tried out the Wasserstein loss function which is used in the WGAN paper and this slightly improved the results. During the contest, most of the participants seemed to use the RaLS (Relativistic Average Least Square) Loss function, but despite many efforts, I could not get it working in Tensorflow. The takeaway from this point is trying out different loss functions can help.
Over Training might Hurt:
While training your network you might be tempted to train your GAN for as long as you possibly can, but overtraining might degrade the quality of the generated samples. Most of the contest participants reported that the best images were produced in between the training loops and not towards the end. In the contest the MiFID (Memorization-informed Fréchet Inception Distance) metric was used to evaluate the generated samples, so I monitored this metric and used Early Stopping to achieve the best results.
Most of the times the data we have is multimodal, which means it has multiple modes where each mode represents data which share similar features.
A very common problem while training GANs is Mode Collapse in which the Generator produces images belonging to only a few modes of the data and ignores all other modes, hence the generated images are very similar and lack variety. There are a few suggested ways to solve this problem but I could not try all of them out. Experimenting with the Learning Rate of both Generator and Discriminator helped me to overcome this problem. So while training your GANs you should monitor the generated samples to see if your model is suffering from this problem.
There is also this GAN Hacks repository which enlists many other techniques that you can use and most of them helped me to converge faster and achieve better results. Some more components that you can experiment with:
- The architecture of your GAN is the most important thing. You can take inspiration from various famous GANs and implement your own version of it.
- Although most of the GANs use ADAM optimizer for training, don’t be afraid to try out the good old Gradient Descent optimizer.
- For the activation functions, instead of using the simple ReLU you can try out its other variants such as Leaky ReLU, ELU (Exponential Linear Unit) or PReLU (Parametric Rectified Linear Unit).
- You can also use various image augmentation techniques while training.
During the training of various GANs, I also found out that they were very sensitive to the changes and most of the times even a small change anywhere changed the results drastically. Despite the extensive research going on in the field of Deep Learning, many topics still remain a black box. I realized this during the contest, most of the things that I was implementing and testing, I could not find any reasonable explanation for most of them. That is why the configurations that worked for me are not guaranteed to work you as well but these are some of the topics that you can focus on while training your GANs because these things helped me to achieve good results. The configurations of all these features will depend on the type of problem that you working on and also the data you have.