Advancements of Deep Learning 3: Generative Adversarial Networks (GANs).

Dashanka Nadeeshan
Coinmonks
12 min readAug 5, 2020

--

GAN generated images. sources: fake face images

A brief discussion without math…

It does not need to explain how deep learning has influenced the modern tech world in both research and industry. Convolutional neural networks are one of the leading vision-based deep learning class of architectures. We have discussed the basic operation under part 1 and influential network architectures under part 2. Today we discuss another one of the major improvements of deep learning, Generative adversarial networks or GANs.

The idea of Generative adversarial networks or GANs was first introduced by Ian Goodfellow in 2014 and the basic idea was to generate new data. Inherently, GANs do not require labelled data to learn, which means they fall under unsupervised learning. Over the past couple of years, GANs and their recent advancements have achieved tremendous progress in computer vision such as generating high-quality images, images from text, changing the features of images like faces, and many more. Let us first briefly discuss how GANs work and then the evolution of GANs with prominent developments. Furthermore, we do not expect to discuss sophisticated maths behind those GANs architectures but an overview of how they work and what they are.

Generative Adversarial Networks (2014)

Figure 1: An illustration of generative adversarial networks

The basic GAN architecture is comprised of two separate models namely generator and discriminator. The generator creates fake data and the discriminator distinguishes between real and fake data. One can also say that generator is trying to fool the discriminator. Generally, these two models can be either convolutional neural networks or recurrent neural networks or maybe autoencoders in some cases. For architectural understanding, figure 1 above depicts a basic network architecture of a GAN. In the training phase, the generator learns to become better with generating data that can fool the discriminator while the discriminator learns to tell how fake or real is the input data. As both models are implemented using deep neural networks, the parameters of both networks are tuned simultaneously and they both get better over time. Moreover, Feedback from the GAN output is used to train both generator and discriminator through backpropagation algorithm. The output of the discriminator is a probability of input is being a real. Higher the probability, more likely the input is from real data fed and lesser probability means fake data generated by the generator. When the discriminator is not being able to distinguish between fake and real, the probability output would be 0.5 which gives us the optimal solution. Furthermore, the loss function of a basic GAN architecture mainly based on two neural networks compete against each other which are differentiable with respect to their parameters and inputs. However, there is a lot to discuss even under the most basic form of GANs and one can find plenty of literature in Medium itself and the internet as per their understanding and requirement.

Figure 2: A classification of GAN models. source

Before diving into the evolution of GAN architecture, let us look at a classification of GAN models in terms of object function, structure and condition as illustrated in figure 2. These GAN models have tried to tackle and solve the issues that occurred with the original GAN architecture such as unstable training, gradient disappearance, and unstable training. Furthermore, these developed GAN formulations and developments have addressed different areas, dimensions and applications. Now let us briefly discuss several prominent GAN models and see how the field has extended.

Conditional Generative Adversarial Networks: CGAN (late 2014)

Figure 3: Typical CGAN architecture

As it mentioned before GAN employees unsupervised learning methodology. However, often GANs face the problem of controlling the generated images when the data set is complex or too large. In order to mitigate this, conditional GANs are proposed by introducing constraints or pre-defined targets for the generator. This can also be introduced as a conditional generation of images by the generator. The introduction of conditional constraints to the generator converts the learning of GAN from unsupervised to supervised up to some extent. These conditional inputs could be descriptive class labels or image features of the desired outcome which guides the generator. As an advantage, adding conditions helps faster convergence but at the same time, more requirements have to be met for the real dataset such as marked or tagged data. The introduction of CGANs has become very useful for many applications and further developments.

Deep Convolutional Generative Adversarial Networks: DCGAN (2015)

Figure 4: The typical architecture DCGAN. source

As the name explains, deep convolutional neural networks are being used within GANs and it had become a complete success. The key idea here is to combine deep CNNs with GAN architecture and to develop Deep Convolutional Generative Adversarial Networks (DCGANs). One of the main improvements that help the DCGAN to perform well is the introduction of strided convolutions to the discriminator and the replacement of pooling layers of the generator by the fractional-strided convolutions. Since CNNs in GANs are required to perform the opposite of what CNNs used to do, both strided and fractional strided convolutions minimize the loss of information. This is advantageous during the feature extraction and also helps to retain completeness of data. Another main improvement would be using Batch Normalization to tackle the vanishing gradient problem by fixing poor initialization and bringing the gradient for each layer separately. Furthermore, different activation functions for CNNs also helps the DCGANs to perform well. Therefore, DCGANs can also be introduced as a structural improvement of GANs. This architectural improvement has allowed overcoming issues such as instability, fast convergence and internal covariate shift. Upon the DCGAN development, many more further improvements have also proposed.

InfoGAN (2016)

Figure 5: Basic InfoGAN architecture

InfoGAN is quite similar to what happens with CGANs. As there is a conditional input to the generator in CGANs, the generator of InfoGANs also provided with the input which comprised of two components. A label from dataset just like in CGANs or a noise vector and a latent feature vector extracted from an additional discriminator output. This information extraction is often done via another neural network. The very last layer of the discriminator is the input to this particular neural network and provides a statistical estimation of latent features with maximized mutual information. The cost function for the InfoGAN is developed by subtracting the general GAN cost function with an additional function which contains mutual information about how much we know about inputs to the discriminator if we know the output of the discriminator. In case of the mutual information function being zero means that the generated data and the estimated latent features are completely irrelevant. Furthermore, a regularization term is also added to this mutual information function in the InfoGAN loss function which is the likelihood of representing latent features in generated data.

text-2-image (2016)

Figure 6: text-2-image convolutional GAN architecture. Text encoding input is used by both generator and discriminator. source

As the name suggests, text-2-image or Generative adversarial text to image synthesis capable of generating realistic and meaningful images based on the given textual description. An example would be for a given text input describing the colour of a flower and shape of petals, the system will generate an image of a flower exactly like the text input describes. Compared to regular GANs, instead of feeding random noise to the generator, descriptive text inputs are being fed. According to the original paper, the textual input first transformed into a 256-dimensional text embedding and then concatenated with a 100-dimensional noise vector by sampling from normally distributed latent space. The generator will take in the concatenated vector and generate an image which aligns with the textual description. The discriminator will be fed with more than an image like in regular GAN case, namely pair of images and the text embedding. The discriminator will also provide more than a single output. It will identify whether the given image is real or fake as usual. Along with that, it will also predict how likely the given image is aligned with the original textual description. In order to properly train the discriminator, a set of different pairs namely image and text description are fed which are combinations of real images, text, wrong images, fake or generated images. Also, the target values for them in the generator are being set accordingly during the training. For example, the pair of the real image and text combination lets the system to learn how the image and text are aligned. In contrast, the wrong image-text combination explains the opposite of the previous case and the target is set to zero to flag that it is not aligned. Furthermore, the combination of fake image and text will again set the target to zero so the discriminator would be able to distinguish between real and generated fake images. The main paper has demonstrated the capabilities of the model by mainly generating images of birds and flowers from given detailed text descriptions.

Wasserstein Generative Adversarial Networks: WGAN (2017)

In a typical GAN, the generator first encodes a vector sampled from a random distribution of low dimension and generates high dimensional data as it passes through the base neural network. This sometimes leads to gradient disappearance while training of GANs as it is trying to constitute a low dimensional space in a high dimensional space. This may lead to the probability of the overlap between the input data distribution and the generated data distribution become very small or near zero. As a result, the similarity between generated distributions and true distributions could become zero and hence the gradients might disappear. In order to solve this, the Wasserstein Generative Adversarial Networks or WGAN is introduced by Facebook AI Research (FAIR). They have employed Wasserstein distance, which is a measurement of the distance between real data and generated data. This so-called Wasserstein distance can be used to measure the distance between two distribution even though they are not overlapping and also helps to solve the gradient disappearance issue. The authors have shown that how the WGAN is providing improved stability of learning, minimized mode collapse, and also meaningful learning curves. There was a further development namely WGAN with gradient penalty (WGAN-GP) which introduced later(2017). In WGAN-GPs, weight pruning of WGANs is replaced by t Lipschitz constraint method. It is highly recommended to read the original papers to get a full understanding of both concepts.

Energy-Based GAN: EBGAN (2017)

Figure 7: Illustration of the discriminator of an EBGAN

As we have discussed before, The goal of the discriminator is to beat the generator and the probabilistic cost function is used to estimate the loss in regular GANs. This Energy-based GANs (EBGANs) rather take an energy-based representation of GANs. In original GANs, the discriminator is designed similar to a classifier but the EBGAN uses an autoencoder and uses the reconstruction error namely mean square error between the input and the recreation as its loss function instead of probability. This autoencoder extracts latent features from incoming input data to the discriminator by an encoder and reconstructs them with the decoder part as shown in figure 7. When training the discriminator, the cost function is given two objectives to achieve. Keep the reconstruction cost from the autoencoder for real data low and penalize the discriminator if the reconstruction error for generated data drops below a predefined threshold. Better training stability and enhanced robustness can be recognized as advantages of EBGANs which helps to get rid of manual regulation of GANs up to a considerable extent.

CycleGAN (2017)

Figure 8: Images generated using CycleGANs. source

The key idea of CycleGAN can be expressed as the translation of an image from one domain to another domain or in other words the cross-domain transformation. The basic working principle is nothing but a simple reconstruction of an image which is influenced or resembled with certain styles. The architecture of CycleGAN consists of two generators and two discriminators. The first generator maps original images to the target or resembled domain and the second generator maps those target domain to original image domain. These two domains are the ones it requires to learn to translate in between and it is a two-step transformation. Each generator has its own corresponding discriminator which basically distinguish between real and synthesized images. Both discriminators help to improve the generated image quality based on the least square loss and normally it uses fully CNNs that look at a patch of an image at a time and outputs the probability that patch is being real (PatchGANs). Furthermore, generators from CycleGANS typically have an encoder-transformer-decoder architecture while the encoder section has convolution layers, the transformer section has residual blocks, and the decoder section has transpose convolutions. When it comes to the objective function, there are two main components namely adversarial loss and cycle consistency loss. Moreover, CycleGANs have the capability of remembering a history of generated images to train the discriminator. However, this might leads to both generator and discriminator to overfit and mode collapse (due to the cycle of greedy optimization). Possible applications of CycleGANs in image domain would be New painting style development or conversion, seasonal changing of scenery images, 2D drawings to 3D image conversion and many more.

CapsGAN (2018)

CapsGAN is a novel model architecture proposal by two researchers from the University of Toronto by coalescing two concurrent algorithms namely GANs and Capsule networks. Furthermore, the baseline GAN architecture is DCGAN. The idea here is to generate images in the 3D domain with a high degree of geometrical transformations and the authors have claimed that these CapsGANs are more robust to geometric transformations compared to traditional deep CNN based approaches. One of the main architectural changes would be the discriminator of a DCGAN is replaced by a Capsule network which uses dynamic routing. However, the generator follows typical DCGAN framework with 2D transpose convolutional network. Moreover, the binary cross-entropy loss function helps CapsGAN model to converge stably without pattern collapse.

StyleGAN (2018)

Figure 9: StyleGAN Generator. source

StyleGANs are capable of generating very high-resolution images which are mainly focused on face images. The architecture is comprised of a bulk stack of fully connected layers which gradually generates images over consecutive layers starting from low-resolution to higher resolution. Regular GANs remember images from training data and the random noise is added when generating a new image. However, StyleGANs also learn features from generated images as well and capable of generating new images (high-resolution face images) that simply does not exist. In the StyleGAN generator or style based generator, the input noise to the generator or the latent space vector is passed through a sequence of fully connected layers who are performing a mapping transformation and this part is called as mapping network. These layers responsible for producing images from lower resolution to higher resolution as explained above. Furthermore, the produced images are then passed through a consecutive synthesis network which situated at the latter part of the style-based generator and this is also known as the Style. This Style contains major attributes like pose, colours and other important features. The synthesis network starts with a learned constant convolution layer and the Style comes from mapping network are fed in at the multiple levels of the network. Adaptive instance normalization is used to normalize each layer and noise is also introduced to create stochastic variations such as texture and colour variations. The results of StyleGANs a so phenomenal as it has been able to generate a vast variety of real looking images especially in the face image domain.

There have been very briefly discussed several widely popular GAN model developments that have shaped the GAN in terms of architecture, capabilities and application fields. It is amazing that what GANs can do and how it can use for a tremendous amount of applications. Furthermore, if you look the academia, you will find a number of GAN models that have already developed so far (Check the GAN Zoo) and more developments are yet to come which address many unexplored dimensions. Therefore, it is highly recommended to go through more details and literature especially related research papers to understand how these amazing systems work and ways to use them with appropriate applications if you are interested. However, it is our responsibility to aware of the repercussions of GANs and makes use of these concepts ethically. See you again in another article and thank you for reading.

--

--

Dashanka Nadeeshan
Coinmonks

Student at Hamburg University of Technology. Study Mechatronics, Robotics and Intelligent Systems. Visit me@ https://www.linkedin.com/in/dashankadesilva/