Synthetic data generation using Generative Adversarial Networks (GANs): Part 2

Published in

Data Science at Microsoft

18 min readJun 15, 2021

Generative Adversarial Networks — GANs — employ a deep learning model to generate synthetic data that mimics real data. They have multiple applications, including processing and working with images, text, and other data. The goal of this two-part article series is to give beginners a complete understanding of GANs. In the first article of the series, my colleague Daniel Huang introduced GANs at a beginner level, provided an overview of how they work, and covered various use cases. In this article, I take a deep dive into the inner workings of GANs, providing more information to enable you to gain a depth of understanding that will allow you to begin using GANs in your own work.

GANs overview

A major category of Machine Learning (ML) techniques consists of unsupervised methods. In these methods, the training dataset contains only samples of data (e.g., payroll data or images, among others) and there are no “ground truth” labels for training purposes (e.g., labels such as “high” or “low” for payroll data, or labels such as “dog” or “cat” for images of animals). In contrast, in supervised techniques, a model is trained by using both the training dataset and the ground truth labels such that the labels are used to correct the prediction output of the model.

Generative methods are unsupervised learning techniques focusing on discovering the patterns (without knowing the patterns in advance) and latent features of a dataset (including image, text, or tabular data) such that a trained model can generate new data instances with characteristics similar to the original data. Models predicting the next word in a sentence, using techniques such as Latent Dirichlet Allocation (LDA) and Variational Autoencoders (VAE), are examples of generative models.

Discriminative models, in contrast, are supervised techniques addressing the data classification problem. Given input such as a dataset of animal images, they can classify each image as a dog or cat, for example. Techniques such as logistic regression, Random Forest (RF), and Support Vector Machines (SVM) are examples of discriminative models.

Generative Adversarial Networks — GANs for short — use a generative method combined with a deep learning–based ML approach. Notably for GANs, however, is that the GANs training process of the generative model is actually formulated as a supervised process, not an unsupervised one as is typical of generative models.

GANs models have two main components:

Generator: A generative model to learn the latent features of a target dataset, which, after training, are used to generate new data instances like the original training data.
Discriminator: A classification model aiming to distinguish real (the original dataset) and fake (synthetic data from the generator) data, which is discarded after training.

GANs architecture

Figure 1 shows the overall architecture of GANs models.

GANs training and the Nash equilibrium

The training of GANs is based on a zero-sum or minimax game with two players, each one (G and D) trying to maximize its own benefits. The game converges when both players reach a point that changing their actions (updating the weights of neural networks) does not bring more benefits (or the loss functions for G and D cannot be further minimized). This point is the Nash equilibrium for the following equation:

(1)

This equation shows that G tries to minimize the loss function while D tries to maximize it.

The Generator is a neural network model responsible for generating realistic samples from the target domain. The input for the generator model is a vector randomly sampled from a uniform or Gaussian distribution. This vector is used as a starting point for the G model to generate synthetic data in the problem domain. This random vector represents a compressed version of features of the outputs referred to as latent features or a latent vector. In fact, during the training process, the Generator converts this random vector to meaningful data points (e.g., human face images). In this way, each new random vector drawn from the latent space (e.g., Gaussian distribution) is converted to a new output in the problem domain.

Figure 2 shows a sample architecture of a Generator composed of different layers (transpose convolutional, dense layers, and leaky_relu activation function). The input for this architecture is a latent vector of size 100 drawn from a uniform distribution and the output is set as a 28 by 28 image, to be compatible with the input shape of the Discriminator.

Figure 2: Sample Generator neural network layers.

The input of the Generator is a latent vector, and its output (consisting of fake samples) is directly used as the input of the Discriminator. The Discriminator receives input from both the real dataset and the Generator to classify them as real and fake, respectively. In GANs architecture, the Generator (G) tries to minimize the second term of the main loss function (1) but as the first term is independent of G, the effective loss function is:

(2)

The Generator’s loss has one part: The average of inverted probabilities of the fake samples, and during the training process, G aims to minimize this term (which is the common part of the loss function with D). This term is usually represented as g_loss in implementations.

The Discriminator functions as a classifier to distinguish the real samples (in the original dataset) from the fake ones (from the Generator model). The inputs for this component are 1) samples from the original dataset, and 2) synthetic data from the Generator component. As this component is a supervised method (or classifier), it uses a binary label representation, with label 1 for real inputs and label 0 for fake inputs. Like any other neural network, a loss function is needed to train the Discriminator. The following formula demonstrates the loss function as the Discriminator aims to maximize it.

(3)

The Discriminator’s loss has two terms: 1) the average of the log probability of real samples, and 2) the average of the log of inverted probabilities for the fake samples. These two terms are usually represented as d_loss_real and d_loss_fake in implementations.

Figure 3 shows a sample architecture of a Discriminator composed of different layers (consisting of convolutional, dropout, fully connected layers, and leaky_relu activation function). The input for this architecture is a 28 by 28 image and the output (taking the form of the sigmoid function) is a 1 by 1 output representing whether the input is fake or real.

Figure 3: Sample Discriminator model layers.

Training process

The training algorithm is an alternative training process based on the backpropagation technique. Alternative training means the Generator and Discriminator are trained in one loop, one after the other. In this way we train the Discriminator for one step (using a mini-batch of input data) and then train the Generator for one step (using a mini-batch of latent vectors), repeating these steps until the model converges. During Discriminator training, the Generator is frozen and does not train, and vice versa.

Figure 4 shows the Discriminator training process.

Figure 4: Discriminator training process.

Real samples are random mini-batches of the original dataset from the problem domain (e.g., images or tabular data), which are sampled to be classified as the real class by the Discriminator.

Fake samples consist of a sample of synthetic data from the Generator (e.g., fake images or tabular data) output to be classified as a fake class. It is worth mentioning that the quality of fake samples is intended to be improved during the training process, and so if we visualize the fake samples from the early training steps of the Generator, we see they are noisy and not meaningful in comparison to real data samples, which are always high-quality data in comparison.

The Discriminator is connected to two loss functions. One is used to train the Discriminator, and one is used to train the Generator. During the training of the Discriminator, the Generator is frozen and does not train, and the Discriminator classifies its two input sources as real and fake classes. The Discriminator network weights are updated and trained using the backpropagation mechanism and regular stochastic gradient descent (SGD). The loss function used in backpropagation penalizes (via error correction) the Discriminator model for cases in which the Discriminator incorrectly classifies the fake samples as real and vice versa. This error correction is high at the beginning of the training process and gradually converges to near zero (ideally).

Figure 5 shows the Generator training process.

Training the Generator is more complicated than the Discriminator. To train a neural network model using the backpropagation method, we need to calculate the error value of the model’s output using a loss function and then update the model weights by propagating the gradients across the neural layers, aiming to minimize the error value. In GANs architecture, the generator is not directly connected to a loss function. In fact, the Generator is used as an input of the Discriminator and the loss function connected to the Discriminator is used to train the Generator. These are the steps used to train the Generator:

A batch of random latent vectors (z) is drawn from a distribution, such as a Gaussian distribution.
A batch of fake samples (G(z)) is generated from the Generator.
A batch of fake data is fed to the Discriminator to be classified as real.
The Generator loss function (connected to the Discriminator) is calculated, and the error is backpropagated to the Generator layers while the Discriminator is frozen (does not update). In this way, the loss function penalizes the Generator for samples the Discriminator classifies as fake. This is the critical point where the Generator weights are updated to generate samples that can fool the Discriminator as real samples.

Complete training process

Figure 6 shows the complete Generator and Discriminator training steps. The first lines show the code that defines the G and D and GAN, which actually consists of D(G(z)), as well the real and fake labels. The training process contains many iterations (an “epoch”), which is a hyperparameter not known in advance. At the start of each iteration, we must prepare the real and fake mini-batches (shown in lines 4–6). The fake samples are outputs of G, with a random latent vector as the input. In the early stages, in which G is not yet trained, the fake samples are garbage.

With the real and fake samples, we start by training the Discriminator with two labels: Label 1 for real samples and label 0 for fake samples (shown in lines 7 and 8). The train_on_batch() function implements the cross-entropy binary loss to calculate the gradients of the D (with regard to the loss function) and backpropagates them to the layers of D to train it as well. Next, after locking D (as shown on line 9), G — which is D(G(z)) because G is updated through D — and the loss function are connected to D, which is then trained. These steps are repeated until the model converges or fails (in the process of looking at d_loss_fake, d_loss_real, and g_loss).

GANs training problems

While GANs introduce a variety of benefits in areas such data privacy, image inpainting, or data augmentation, they also suffer from some training challenges. The main training challenges of GANs are (but are not limited to) mode collapse and non-convergence.

Mode collapse

When we train GANs, with different random latent vectors as inputs of the Generator component we expect the Generator to produce different outputs (e.g, different images) as well. But there are cases where the Generator starts producing the same output or a limited list of outputs again and again, which is called mode collapse. For images, this means that the outputs of the Generator share the same texture, color, and image features. The mode collapse can even be a full collapse to only one image (i.e., the same images with only tiny and neglectable differences) or it can be a partial collapse to multiple images. Figure 7 shows a partial mode collapse in which images with the same underline color are similar images.

Resolving the mode collapse problem is an active research area and one of the most important challenges in GANs research. There are some early solutions to address mode collapse:

Unrolled GANs: An unrolled GAN is like having multiple copies (e.g., three or five) of the Discriminator grouped together and coupled with the Generator. The Generator is updated through all these Discriminators using the backpropagation mechanism. In this way the Generator can predict how the Discriminator will be updated in the following k steps. This introduces a “surrogate loss function” for the Generator (while Discrimination still uses the regular loss function of the GAN).

In this technique, the Generator is updated through multiple Discriminator optimization iterations and the Discriminator is updated just once. These differences break the sync between the Generator and Discriminator updates, which can prevent the Generator from collapsing into a local minimum and makes the training more stable. Figure 8 compares the visualized heatmaps of the Generator distribution in the GAN, showing ten unrolled Discriminators on the top row and a standard GAN on the bottom row. As we can see, the distribution of the Generator in the unrolled version gradually reaches the target or desired distribution in the last (Target) column — but the standard GAN fails.

Figure 8: Unrolled GAN vs standard GAN, adopted from: Metz, L., Poole, B., Pfau, D., & Sohl-Dickstein, J. (2016): “Unrolled generative adversarial networks”. (arXiv preprint arXiv:1611.02163).

This is how the authors describe it in their paper: “In the unrolled case, however, this undesirable behavior no longer occurs. Now G’s actions take into account how D will respond. G will try to make steps that D will have a hard time responding to. This extra information helps the generator spread its mass to make the next D step less effective instead of collapsing to a point.”

The sample code implementing the unrolled GAN can be found here.

Wasserstein Loss: The Wasserstein distance approximates the Earth mover’s distance (EMD). In GANs based on the Wasserstein loss function (WGAN), the Discriminator (which is called Critic) does not classify its inputs (real and fake samples) into probabilities between 0 and 1. The output of the Discriminator is a distance value and can be any number, and so the Discriminator does not have a sigmoid function in the last layer of its neural network. The Discriminator tries to maximize the distance metric of D(x) – D(G(z)) and the Generator tries to minimize this distance, or more precisely, it tries to maximize the D(G(z)) term. The output of the Discriminator does not need to be between 0 and 1 (to be used by the binary cross-entropy loss) and can be any number. Also, the Discriminator can be optimized without limitation by Generator updates. This leads the WGANs to have a reduced vanishing gradient effect and helps the Generator receive good feedback from the Discriminator, reducing the chance of getting stuck in a local minimum that can lead to mode collapse.

The Wasserstein GANs also have some considerations and limitations. They must observe the requirements of the Lipschitz constraint, which can be satisfied either by gradient clipping or by the gradient penalty. Moreover, momentum-based optimizers like Adam cannot be used, which means that the RMSProp optimizer is a good choice to be used instead with these GANs.

Non-convergence

Non-convergence failure is the case in which the Generator and the Discriminator cannot find an equilibrium point during the training process, causing their loss functions to fluctuate. GANs are based on a zero-sum game, and the game stabilizes only when both players reach a point that changing their actions does not bring more benefits.

Stable Discriminator loss values are around 0.5 (which means G can generate high-quality fake samples and D cannot distinguish almost any of them), which is hard to achieve, but loss values around 0.7 or 0.8 are also acceptable. Loss values in the range of 1.0–1.5 for Generator are good numbers for many cases. In non-convergence failure, the Discriminator’s loss may go to zero or the Generators’ loss may rise continuously. This can happen at the beginning of the training process or in cases in which the Generator produces garbage outputs that are easy for the Discriminator to classify as fake samples. As the training (updating the neural network weights) process starts from the D loss function, when D can classify G outputs easily, the D error rate is close to zero and the gradients propagated from the Discriminator to the Generator are not large or powerful enough to train the Generator. For some GANs this instability starts at early epochs and then recovers later, while for others this instability continues throughout the entire training process, leading to a training halt.

GANs evaluation methods

Evaluation of GANs means assessing a trained Generator instead of the training process. It means that once the Generator is trained, we can evaluate the quality of the Generator’s outputs. During GAN training, the loss value of the Generator and Discriminator show the effectiveness of the training process and whether G and D are converging or not. The methods explained below are mainly used for GAN-generated images. This paper by Ali Borji has covered most of the GAN evaluation metrics, such as Fréchet Inception Distance (FID), AM Score, Maximum Mean Discrepancy, and Number of Statistically-Different Bins (NDB).

Manual evaluation

In this approach, which is applicable only to GANs generating synthetic images, we can visually assess the quality and diversity (regarding mode collapse) of the generated outputs. While this approach is straightforward and quite simple to apply, it has some drawbacks:

It is time consuming and the number of images that can be evaluated in a particular time is limited (although this can be improved by adding more reviewers).
The image-reviewing results can be biased by how human reviewers analyze the images and find the features they are expected to evaluate.

Inception score

The notion of an inception score (IS) is introduced in the paper “Improved Techniques for Training GANs” and is based on using a pre-trained model (such as Google Inception v3) to evaluate the results of images generated by a GAN. This score is applied to all the images generated by a GAN and the average of all those individual scores is the final score of the GAN under test. The range of scores is between zero and the number of image classes in the training dataset of the pre-trained model; for example, using a pre-trained Inception v3 as the classifier model with the ILSVRC 2012 training dataset, which has 1000 classes, or with the CIFAR-10 dataset, which has 10 classes. This score measures two aspects of the generated images:

Image quality: Can each generated image be classified as one specific class? The output of the pre-trained model (e.g., Inception v3) for each generated image is a likelihood vector of class probabilities (each between 0.0 And 1.0 with a cumulative sum to 1.0). If the image has high quality, then only one index of this likelihood vector has high value, and all other values are small; for example, an image with a clear picture of a cat with a prediction vector of [0.90, 0.05, 0.02, 0.03] versus an image with no clear picture of a cat with probability vector of [0.35, 0.25, 0.15, 0.25] for an image set with four image classes (cat, tiger, lion, and jaguar).
Image diversity: Do the generated images cover a wide range of classes? By merging all the individual prediction vectors across all generated images, we have a probability vector (marginal distribution) representing the class diversity of the generated images. If all image classes are presented, then this averaged vector has uniform distribution (e.g., [0.24, 0.25, 0.26, 0.25]), but if some classes are repeated more than others (making things less diverse), then it is like having a vector with high numbers for only more repeated classes (e.g., [0.70, 0.10, 0.03, 0.07]).

Now we have one label prediction vector for each image (as a conditional probability p(y|x) vector for y as label and x as the image) and one marginal distribution vector for all images (p(y)). For an ideal GAN we want to have one specific and distinct class for each image and high diversity across all the generated images. In this case, each image prediction label should be a narrow distribution (one large number and many small numbers) and the marginal distribution should be close to a uniform distribution (covering all image classes). Because of this difference the distance between these two vectors is high. To measure this distance, researchers have used the Kullback-Leibler (KL) divergence to measure the similarity or difference of these two probability vectors. Averaging all the individual scores (KL distances) shapes the final inception score (IS). Mathematically, the KL formula is:

Kullback-Leibler (KL) divergence formula

To apply this formula for the inception score, P(x) is the image prediction label vector, and Q(x) is the marginal distribution across all images.

If the generated images are of high quality and simultaneously cover a wide range of classes, then the individual KL distance is high, and the final inception score will be high as well. If either the quality or the diversity of images is low, then the final score will be low as well.

The IS score is highly dependent on the pre-trained classifier and its training dataset. This introduces some limitations for the inception score, such as:

If the generated images are not like the training data set images, then it can lead to low scores.
If the generated images cover all the classes but inside each class, they have very few diversities, and it still leads to a high score. Diversity inside classes is not considered.
If the classifier failed to detect some features of the images but still generates narrow prediction vectors (one class with high confidence), such as classifying the animals with double faces as well as single face ones, then it still leads to high scores.

Using full GANs

From the first introduction of GANs architecture in 2014, many improvements and applications have been developed by researchers. Here some useful applications are introduced.

Realistic photographs

Researchers describing their work in the paper “Large Scale GAN Training for High Fidelity Natural Image Synthesis” introduced the “BigGAN” model as one highly befitted from scaling up the GANs (i.e., increasing the batch size and number of channels). The result is very realistic and high-resolution images with high inception scores of 166.5 for 128 by 128 images.

High-resolution fake images: “Large Scale GAN Training for High Fidelity Natural Image Synthesis”

Photo inpainting

Authors describing their work in the paper “Context Encoders: Feature Learning by Inpainting” have developed a visual feature learning architecture called “context encoders” based on encoder-decoder architecture, which can be used for photo inpainting and filling in the missing parts of images. The proposed architecture leverages training convolutional neural networks using an adversarial loss function, plus a reconstruction loss (like auto encoders).

Text-to-image translation

The StackGAN architecture introduced in the paper “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks” demonstrates a GAN that generates images conditioned on text descriptions. The StackGAN architecture is based on the more general Conditional GAN architecture. In conditional GANs, the Generator and Discriminator have additional input variables c as G(z, c) and D(x, c), which allows them to generate images based on a specific condition. StackGAN introduces a “conditioning augmentation” technique that receives the text embeddings and outputs a conditioning variable that is paired with the regular latent vector (drawn from normal or uniform distribution) used to train the GANs.

Text to image samples : “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”

Face frontal view synthesis

Generating frontal views of human faces from different face poses and angles has many applications in facial recognition systems. Researchers describing their work in the paper “Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis” have developed a GAN architecture called a Two-Pathway Generative Adversarial Network (TP-GAN) to address this issue. At the core of the architecture is a synthetic loss function composed of pixel-wise loss, identity loss, symmetry loss, and adversarial loss functions.

Face frontal views: “Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis”

Conclusion

In this article I have taken a deep dive into GANs, providing an overview of the inner workings of GANs architecture and how GANs are typically used. This article builds on the information presented by my colleague Daniel Huang in the introductory article of this two-part article series. Below I present some additional information on GANs, including demos, references, and useful links. By knowing how GANs function, we hope you will be equipped to use GANs in your own work.

Mahmoud Mohammadi is on LinkedIn.

Demos

Tabular data generation: This Jupyter Notebook shows a GAN generating synthetic tabular data.
Image Generation: This Jupyter Notebook shows a GAN generating fake images.

References and useful links:

Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in Neural Information Processing Systems, 2014.
OpenAI GAN Introduction Video (video by Ian Goodfellow).
Introduction to GAN by Google.
OpenAI Blog Post.
Data Synthesis based on Generative Adversarial Networks.
Image-to-image translation with conditional adversarial networks.” Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016). Generative adversarial text to image synthesis.
Storygan: “A sequential conditional gan for story visualization.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
StyleGAN: Karras, Tero, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative adversarial networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
Anime GAN: Jin, Yanghua, et al. “Towards the automatic anime characters creation with generative adversarial networks.” arXiv preprint arXiv:1708.05509 (2017).
Arjovsky, Martin, Soumith Chintala, and Léon Bottou. “Wasserstein gan.” arXiv preprint arXiv:1701.07875 (2017).
Lucic, Mario, et al. “Are gans created equal? a large-scale study.” Advances in neural information processing systems, 2018.
http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture13.pdf
https://www.cs.ubc.ca/~lsigal/532L/Lecture11.pdf
https://slazebni.cs.illinois.edu/spring17/lec11_gan.pdf
NIPS 2016 Workshop on Adversarial Training (Video by Soumith Chintala).

For an introduction to GANs, please see the introductory article of this two-part article series:

Synthetic data generation using Generative Adversarial Networks (GANs): Part 1

Generative Adversarial Networks (abbreviated as GANs) are a type of deep learning model gaining prominence in the AI…

medium.com