Generating synthetic tabular data with GANs — Part 2

Fabiana Clemente
YData
Published in
5 min readMay 8, 2020

This is part 2 of 2, from the tabular data generation with the GANs webinar. In part 1 we’ve given an overview of what GANs are, their application scope, and some of the most commonly used architectures in Generative Adversarial Networks.

Now that we've a pretty good idea, it’s time to cover in part 2, how can we implement a GAN, so we can generate synthetic regular tabular data. For the purpose of the below examples, two GAN architectures were chosen: the Vanilla GAN and Conditional GAN.

The dataset that will be used you are already pretty familiar with, Credit Fraud from Kaggle. The dataset contains transactions made by credit cards between September 2013 by European cardholders. It presents a total of 284,807 transactions, with only 492 of them labeled as fraud, resulting in a highly unbalanced dataset, where the positive class (fraud) accounts for only 0.17% of all transactions. It would be nice to be able to augment these fraud events right?

The full dataset includes only numerical input variables which are the result of a PCA transformation, the only features that have not been transformed are ‘Time’ and ‘Amount’. The dataset does not have the presence of missing values, which eases our task. Another important point is the skewness of the Amount variable.

For the purpose of using GANs to generate synthetic data, we’ll only use the minority class: the fraudulent cases.

Transformation to de-skew the Amount variable

Now that we are familiar with the dataset, let’s starts with the GANs. Both the architectures covered next were implemented using Tensorflow 2.0.

Vanilla GAN

As explained before, the Vanilla GAN includes two different networks in its architecture, a Discriminator and a Generator.

For the Generator it was decided to use a network with 4 dense layers as depicted in the code snippet below:

Vanilla GAN generator. Any other network can be used.
Vanilla GAN generator summary.

On the other hand, the Discriminator has also been implemented as a 4 Dense layer network:

Vanilla GAN discriminator. Any other network can be used.
Vanilla GAN discriminator summary

The full code for GAN architecture, including the training step, can be found here.

The image below depicts the training for Vanilla GAN with the following hyperparameters:

Vanilla GAN train
  • Batch size: 128
  • Epochs num: 500
  • Gen LR: 5e-4
  • Disc LR: 5e-4

Conditional GAN

In Vanilla GANs, there is no control over the modes of the data to be synthesized. Conditional GANs (CGAN), introduce the label y as an additional input parameter to both the generator and discriminator. This gives and head start to GAN for what to look for, and improve the overall process of data generation.

GAN vs Conditional GAN

For this architecture was decided to also use a Discriminator with 4 Dense Layers:

Conditional GAN generator. Any other network can be used.

And a Generator with also 4 Dense layers, as per the below definition:

Conditional GAN discriminator. Any other network can be used.

As we don’t have labels to condition the train — we’re only using the minority class — it was needed to compute labels, in this case using the KMeans algorithm to create two classes from the fraudulent events. You can find the implementation step-by-step here.

The full code for the implementation of this Conditional GAN can be found here.

Comparing GANs training

In order to compare both GAN architectures, we’ve trained them for around 5000 rounds and results examined along the way. The below figure depicts the actual fraud data and the generated fraud data from the different GAN architectures as training progresses.

It’s possible to observe that at the step 0, all of the generated data shows the normal distribution of the random input fed into the generators. With the training process evolution we can observe that both GANs start to learn the shape and range of the actual data, but then collapse towards a much smaller distribution, meaning that the generator has learned a small range of data that the discriminator has a hard time detecting as fake. Although the CGAN architecture does a little better, spreading out and approaching the distribution of each fraud data class, it ends up collapsing at step 5000.

Training process of both GAN architectures over 5000 epochs

Statistically, they are quite similar around epoch 500.

Mean differences per variable between synthetic and real datasets

Now you are wondering, how can I check the quality of my generated synthetic dataset?

The figure depicted above compares statistically, in terms of the mean, how similar are the real and synthetic datasets generated by Conditional GAN. It’s a quite simple validation, but there are others more robust and that measure other aspects of the datasets similarity. These methods can be something so straightforward as the calculation of distance metrics, such as Euclidean Distance, or divergence metrics such as Kullback-Leibler.

Nevertheless, there is one, that personally I consider to be more efficient, the so-called Train Synthetic Test Real (TSTR). With TSTR, the basic idea is to use synthetic data generated by the GAN to train the model and then test the model on a held-out set from the real data.

In visual terms, the use of dimension reduction techniques such as PCA and TSNE, can also provide us good inputs regarding the quality of the synthetic dataset.

Conclusion

The results shown in this article, although very simple, let us know the potential of GANs to generate synthetic data, with real-value to be used in Machine Learning tasks and shared in a privacy-preserving manner.

Although image generation is not an easy task, generating tabular data can be extra challenging. Tabular data usually contains a mix of discrete and continuous columns, and a lot of the existing statistical and deep neural nets models fail to property model this type of data. A lot of these challenges are pretty much familiar to many Data Scientists, and of course there is a lot more about GANs that we will be covering in more depth over the next articles.

In case you want to check out other examples of synthetic data, you can find some cool synthesized datasets and comparisons on our website!

Fabiana Clemente is Chief Data Officer at YData.

Making data available with privacy by design.

YData helps data science teams deliver ML models, simplifying data acquisition, so data scientists can focus their time on things that matter.

--

--

Fabiana Clemente
YData
Editor for

Passionate for data. Thriving for the development of data privacy solutions while unlocking new data sources for data scientists at @YData