Synthetic data generation using Generative Adversarial Networks (GANs): Part 1

Daniel Huang
Data Science at Microsoft
8 min readJun 1, 2021
Photo by Hassan Pasha on Unsplash.

Generative Adversarial Networks (abbreviated as GANs) are a type of deep learning model gaining prominence in the AI community and opening up new directions in research. Because of their versatility, GANs are seeing application in a variety of fields, ranging from medicine to art — and many others in between. The goal of this two-part article series is to give beginners a complete understanding of GANs. In this first article, I introduce GANs at a beginner level, provide an overview of how they work, and cover various use cases. In the next article, my colleague Mahmoud Mohammadi takes a deep dive into the inner workings of GANs, providing more information to enable you to gain a depth of understanding that will allow you to begin using GANs in your own work.

What is a GAN?

GANs generate synthetic data that mimics real data. This deep learning model includes a training process that involves pitting two neural networks against each other: a generator, which generates the synthetic data, and a discriminator, which distinguishes between real and synthetic data. The training process involves a competition between the generator and discriminator such that both models improve. In the end, if all goes well, the generator is able to generate natural-looking synthetic images that are difficult for the discriminator (or a human) to discern as real or synthetic.

GANs are widely used in image and video generation, among other use cases. A canonical example involves image generation of human faces. In this undertaking, researchers trained a specific GAN architecture called StyleGAN2 to generate synthetic images of human faces and display them on a website. The work is described in more detail in the paper “Analyzing and Improving the Image Quality of StyleGAN.”

GANs were first introduced in a research paper by Ian Goodfellow and other researchers. Since then, they have received significant attention from the research community, with Yann LeCun describing GANs as “the most interesting idea in the last 10 years in Machine Learning” and Andrew Ng recognizing them as “a significant and fundamental advance” in ML research.

GANs data types

GANs can generate several types of synthetic data, including image data, tabular data, and sound/speech data.

Image data

In addition to generating images of human faces, GANs can perform image-to-image translation. In this application, a GAN learns to transform the style of an image while preserving its content; in other words, it takes an image with a style from one domain and learns how to map it to an output image with the style of another domain, thereby “translating” an image from one style to another style.

An example of image-to-image translation involves a project jointly conducted by Microsoft Research Asia and the University of Science and Technology of China, described in “Image-to-Image Translation with Multi-Path Consistency Regularization.” In this paper, the researchers propose a new loss function (i.e., a way to evaluate how the GAN performs on the dataset) called multi-path consistency loss. For translation between two domains X and Z, the authors introduce a third “auxiliary domain” Y such that the image is translated from X to Y to Z, instead of directly from X to Z.

The proposed loss function measures the difference between direct translation (X to Z) and indirect translation (X to Y to Z). In the figure below (reproduced from the paper), the GAN transforms the hair color of the subject in the image. Ideally, the direct translation (from brown to blond) should be the same as the indirect translation, using black as the auxiliary domain (from brown to black to blond). Stated differently, we want the subject’s hair to be the same shade of blond, regardless of the number of translations. Below, the set of images on the left is the GAN result without the authors’ proposed constraint (multi-path consistency loss). It shows that the output from direct translation is visually different from indirect translation: The subject’s hair ends up being two different shades of blond. The set of images on the right, the GAN result with multi-path consistency loss, shows the similarity in output: The subject’s hair ends up being the same shade of blond.

Figure from the paper in “Image-to-Image Translation with Multi-Path Consistency Regularization” comparing the direct and indirect translation approaches.

Tabular data

GANs can also be used to generate tabular data, in an application that can help preserve privacy. In the paper “Data Synthesis based on Generative Adversarial Networks,” researchers propose a GAN-based method called table-GAN, which uses GANs to synthesize tabular data with similar statistical properties as the original tabular data while minimizing information leakage of the original data. Downstream machine learning models can be trained on the synthetic tabular data and achieve similar results as models trained on the original data, meaning the table has “model compatibility.” A tradeoff exists in this application, however, between privacy level and model compatibility. In other words, a higher privacy level (with less information leakage from the original dataset) leads to a lower degree of model compatibility, because the greater the privacy achieved, the more the synthetic data’s statistical attributes differ from the original data. Application of differential privacy, a related area of research, can help to close this gap.

Sound and speech data

GANs can also be used to generate sound data. For example, in the application of text-to-speech, GANs can be used to generate human speech audio. In the paper “High Fidelity Speech Synthesis with Adversarial Networks,” researchers introduce a GAN called GAN-TTS that generates high-fidelity human speech. The GAN-TTS architecture includes a generator that produces audio of human speech, as well as discriminators that analyze the audio’s realism in imitating human speech, including linguistic and pitch features. An example sound clip of human speech generated by GAN-TTS can be found on the web in this location.

Why use GANs?

There are several reasons to use GANs, including addressing data scarcity, ensuring data privacy protection, and augmenting data.

Data scarcity

GANs can be used as a tool to generate synthetic training datasets whenever data is scarce or expensive to obtain. This limitation appears in several fields, such as medicine (certain scans may be expensive to perform, for example, or certain diseases may appear infrequently) and environmental sustainability (in cases where natural disaster images are scarce).

Data privacy protection

GANs can generate data that has the same distribution as real user data. This application is helpful when performing analyses on datasets that contain personal information. For example, for datasets containing customer information, training Machine Learning models on GAN-generated datasets can protect customer privacy, as actual customer data is not exposed to any downstream Machine Learning models. Operating with synthetic data can also enable data sharing without breaching personal information protection rules.

Data augmentation

Traditionally, data augmentation involves rule-based transformations. For example, augmenting an image dataset could involve applying transformations such as mirroring the image or translating the image, among others. GAN-generated datasets can also be used to augment an existing dataset. GAN-generated images can achieve more diversity than can be obtained in simple transformations of the data. Augmenting a training dataset with GAN-generated data can allow a downstream model to both make better predictions as well as increase model robustness.

GANs projects at Microsoft

Applications of GANs are in use in several projects at Microsoft.

Project GeNeVA

One area of GAN research focuses on text-to-image translation. Input to the GAN consists of some textual instruction, and the output is a graphic that is generated based on the input text. One project from Microsoft Research is called Generative Neural Visual Artist (GeNeVA), which is a GAN-powered conversational technology that allows the user to instruct the GAN to iteratively create an image by providing a series of instructions to the GAN.

The key point of GeNeVA is the iterative process. Prior work in this area has focused mainly on one-step generation, but GeNeVA allows the user to continuously modify the generated image by providing additional instructions to the system. This requires the system to recall information from previous instructions to correctly process current and future instructions.

Figure from the Microsoft Research Blog showing how the GeNeVA-GAN generates an image iteratively through instructions from the user.

Privacy-related projects

Another area of GAN research relates to privacy. As mentioned earlier, synthetic data from GANs can be used for downstream tasks (e.g., training a Machine Learning model) instead of using the original data, which protects the privacy of the original dataset. An emerging line of research has shown, however, that GANs can be victims of membership inference attacks, which are adversarial attacks that could compromise the privacy of samples in the dataset.

Sensitive information from the original dataset could be leaked if the GAN-generated synthetic dataset is too similar to the training set containing the sensitive information. A recent study from researchers at Microsoft called “privGAN: Protecting GANs from membership inference attacks at low cost” attempts to combat this problem. The researchers address this vulnerability by developing a GAN architecture called privGAN, which aims to protect against these attacks. To achieve this goal, the generator of the GAN is also trained to defend against membership inference attacks while it maintains the quality of the synthetic data by sustaining a high level of performance with downstream Machine Learning tasks. In this way, privGAN helps to prevent membership inference attacks against GANs that risk compromising data privacy.

Limitations of GANs

GANs can have several limitations, in both implementation and application.

First, as with most deep learning models, training GANs can be hardware- and time-intensive. The intensiveness depends on the desired output. For example, training a GAN to generate full HD images takes longer than generating lower resolution images. The silver lining here, however, is that after spending the time and resources to train a GAN, actually generating the synthetic data is a quick and lightweight process. (The generation process will be covered in detail in Part 2 of this article series.)

Second, training the model is not a stable process. After being trained, the GAN may not always produce useful or high quality/realistic output. A related problem is when a GAN does not generate data with any diversity (which will also be covered in Part 2 of this article series).

Third, training a GAN can involve a large amount of training data. For example, generating high-quality full HD images requires a larger dataset than generating low-resolution images.

Note, however, that the prior three limitations are not specific to GANs: They apply to most neural network–based methods.

A final limitation applies when generating synthetic data for privacy preservation. In this case, there is a tradeoff between utility and privacy. In other words, a dataset with a higher degree of privacy differs significantly more from the underlying real dataset, compared to a dataset with a lower degree of privacy. In this case, performing analysis on a synthetic dataset with a distribution that significantly differs from the real dataset distribution may not necessarily yield results that are as useful (meaning that utility is diminished).

Conclusion

In this article I have provided an overview of the fundamentals of GANs, including some use cases and potential drawbacks. In the concluding article in this two-part series, my colleague Mahmoud Mohammadi covers more specifically the inner workings of GANs. By knowing how GANs function, we hope you will be equipped to use GANs in your own work.

Daniel Huang is on LinkedIn.

--

--