Generating synthetic tabular data with GANs — Part 1

Published in

YData

6 min readMay 4, 2020

Part 1 of 1

This is Part 1 of a two-part blog post.

Last Wednesday (April 29th) I gave my first webinar for the Data Science Portugal community — I have to thank for the amazing opportunity that was given to me by DSPT. In today’s post, I’ll be resuming my presentation and making all the related links available.

Generative Models

Before jumping into Generative Adversarial Nets (GANs), let’s define what does mean Generative after all. A “Generative” model can be described as a class of models that contrasts with the “Discriminative” ones. Let me give you a concrete example:

Classify whether an animal is a cat or a dog

In a Classification problem, where we want to classify whether the characteristics belong to a cat or a dog:

The Generative model would build the model for those who look like dogs and then builds the model for those who look like cats.
Whereas the Discriminative would rather define a decision boundary that separates cats and dogs.

Formally speaking this means that, given a set of data instances X and a set of labels Y:

Generative models capture the joint probability of X and Y, p(X, Y), or just p(X) if there are no labels.
Discriminative models capture the conditional probability of Y knowing X, p(Y | X).

A generative model includes the distribution of the data itself, and tells you how likely a given example is. For example, models that predict the next word in a sequence are typically generative models, as they are able to assign a probability to a sequence of words. Of course this a very general definition of Generative models, but pretty straightforward to understand.

There are many different Generative models, that in a nutshell aim at learning the true data distribution on the dataset. The below figure depicts the taxonomy for Deep Generative models, which can be either of Implicit Density or Explicit.

The focus of this post will be on GANs.

Generative Adversarial Networks

Generative adversarial nets (GANs) were introduced in 2014 by Ian Goodfellow and his colleagues, as a novel way to train a generative model, meaning, to create a model that is able to generate data.

These new deep generative models consist of two adversarial models that compete with each other. The Generative model (Generator), captures the data distribution, while the discriminative (Discriminator) estimates the probability of a sample being real or fake. This is a min-max game, and why? Well, because the Discriminator tries to maximize the objective while the Generator tries to minimize it.

GAN’s are widely used to generate from scratch images, but they can also be used to generate sound, speech, text, and so on. They have proven to be very useful for sem-supervised learning, fully supervised learning, as well as reinforcement learning. As Yann LeCun has described them, GANs are “the coolest idea in machine learning in the last twenty years.”

From selfie to Anime — check here the project

Generate anime characters from a Selfie using different GAN architectures

With this project you can generate an Anime character from a selfie. The results are quite impressive, and each of the animes featured are resultant from different GANs architectures.

This person is not real — check here the project

This person doesn’t exist NVIDIA project

NVIDIA has a side project called This Person Does Not Exist which generates images of people through AI, in particular, using a StyleGAN. This generated pictures are not from real-life people, although at first glance, they look like real people.

CycleGAN — check here the project

CycleGAN is a technique that involves the automatic training of image-to-image translation models without paired examples. The models are trained in an unsupervised manner using a collection of images from the source and target domain that do not need to be related in any way.

Synthetic tabular data generation

Now that we’ve a pretty good overview of what are Generative models and the power of GANs, let’s focus on regular tabular synthetic data generation. In one of my articles you can get an idea of what is synthetic data and why it might be useful for you.

With the recent advance in the GANs field, such as learning to transfer properties, as we’ve seen, it becomes very tempting to apply them to problems in data science tasks. Foward in this article I’ll be sharing some architectures that can be used to generate synthetic data, as well as it’s challenges and drawbacks.

DCGAN

For those that have already worked in image generation, Deep Convolutional generative adversarial Network or DGCAN, is quite familiar to you. It’s known for being one of the most popular and successful network designs for GANs and uses convolutional stride and transposed convolutions for the downsampling and upsampling.

Although it’s widely used for images, in order to be used with successful results for tabular data, some changes to its original architecture, are required:

An auxiliary discriminator was added. And why you might ask? This new classifier is responsible for keeping the semantic consistency of the generated records. A simple but effective example of the need for such auxiliary network is for example to prevent that a record belonging to a “Male” has a disease of type “Uterus Cancer”.
The architecture of the auxiliary discriminator is the same as the Discriminator, with a slight difference concerning the sigmoid function output. This new classifier can improve the generated records quality significantly.

WGAN

Wasserstein GAN, has brought some changes to the Vanilla GAN architecture, that will be described following:

A new loss function is introduced. This loss function is based on Wasserstein-1 distance. In this case the output loss of the Discriminator D is no longer a probability of being real or not, but rather a score in the domain. It’s not rare to see the Discriminator being called critic instead, for this type of algorithm
The optimization problem is now constrained by a Lipschitz function, which was made possible by clipping the weights of the discriminator
And last but not the least, the usage of an alternative optimizer. In this case, instead of using, for example, the Adam optimizer, due to its momentum convergence problems, it’s used RMSProp.

Overall, the Wasserstein Generative Adversarial Network (WGAN), it’s an extension of the original GANs, that improves the model training stability and provides a loss function that correlates with the quality of the generated data.

There are some WGAN variations, that shown to improve overall the records generation results: WGAN with applied Gradient Penalty and WCGAN.

Conclusion

So far this article has covered the concept around Generative Models, as well as a new way to train and improve their generative power, the Generative Adversarial Nets. GANs have a wide range of applications, from semi-supervised learning to reinforcement learning. But what about data generation? In what concerns synthesizing tabular data, this article has covered some of the most commonly used architectures, DCGAN and WGAN.

In part 2 will be explored the use of Vanilla GAN and Conditional GAN to synthesize tabular data and the specific challenges inherent it.

Fabiana Clemente is Chief Data Officer at YData.

Making data available with privacy by design.

YData helps data science teams deliver ML models, simplifying data acquisition, so data scientists can focus their time on things that matter.