Generating a Dataset with GANs

Published in

The Startup

4 min readMay 9, 2020

Generative Adversarial Networks (GANs) — Adobe Stock Image

Having a dataset is a key component to training any sort of machine learning model. But what about instances where you may not have access to the data? Not being able to use a dataset because of data regulation and privacy concerns poses a problem when trying to apply machine learning. How can we train models without being able to use the relevant dataset?

This is where deep learning can help!

Using generative adversarial networks, or GANs, we can generate a dataset for training. We can solve those issues by creating an entirely new dataset based on the original dataset that retains important information.

What are GANs?

GANs are a class of machine learning systems. This technique is known for learning to generate new data with the same statistics as the training set. They are most often used for images, but we wanted to try them on numerical data.

Our Experiment

For our experiment, we worked with the Pima Indians Diabetes Database on Kaggle. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases and contains many diagnostic measurements as well as predictor variables such as the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

We wanted to create an entirely new dataset based on this original dataset that retains important information from the original- which would be useful in solving the problem of restricted access to data due to data regulation and privacy concerns.

We based our approach on the paper Data Augmentation Using GANs by Fabio Henrique K. dos S. Tanaka.

Training the GAN

The original paper examined four different primary architectures:

One 256 dimensional hidden layer
One 128 dimensional hidden layer
Two hidden layers of 128 and 256 dimensions
Two hidden layers of 256 and 512 dimensions

We wanted to see if we could improve the fake data generated by our GAN by tweaking the best architecture possible.

We experimented with altering the batch size and learning rate across models with each of hidden layer architectures.

The four architectures we experimented with

Evaluation

To measure the success of the fake dataset produced by our GAN, we trained a classification and regression tree (CART) on the fake dataset, and tested the tree on the real dataset.

Results

So how did we do? We made sure that our generated fake dataset met a similar distribution of classes compared to the original real data. In the image below, you can see that across different subcategories, the distributions of the fake data (the blue bars) are pretty close to those of the real data (the red bars). Not bad!

We found that the best results were produced by a GAN with a larger learning rate than reported in the paper with the one 256-dimensional hidden layer architecture.

The paper reported a classification accuracy of 74.8% on this dataset, but we were able to achieve a higher accuracy of 79.1%!

Training these models was very interesting. In the image below, you can see a cost function plot over the epochs of training, where the generator’s cost is in red and the discriminator’s cost is in blue. Notice how the generator and the discriminator are in a constant war to outdo each other; we had to ensure the parameters we chose resulted in a stabilized GAN. We ran multiple iterations of each model architecture to ensure the results we were getting were not by random chance.

The generator’s cost (red) and discriminator’s cost (blue) over epochs of training

Future Work

Based on our experiments, we think using the one 256 dimensional hidden layer architecture and a learning rate of 0.002 may be more successful in creating a new dataset that retains important information from the original.

Since the generator and discriminator are constantly doing battle, we think it might be better to terminate training dynamically, rather than arbitrarily at the 500th epoch. Otherwise, it’s possible to end training on an oscillation of the generator where the cost is quite high, which happened with this model.

As an improvement, we could establish the 500th epoch as the earliest possible endpoint, and instead end training only when we read a generator cost that’s lower than or as low as any previously seen cost, with a max epoch cutoff of 600 epochs. This way, we would have the most optimal generator for the task. However, this idea requires further research.

Conclusion

We were able to generate a dataset with the same key features as the original dataset using GANs. Using a learning rate of 0.002 and an architecture of one 256 dimensional hidden layer, we were able to achieve better accuracy than the paper we based our work on. All of our code can be found on github.

The Authors

This project was created by Master of Science in Artificial Intelligence (MSAI) students at Northwestern University:

Aristana Scourtas, Nayan Mehta, and KJ Schmidt