Synthetic Data Generation with GANs Model

Emine Yavuz
KoçDigital
Published in
10 min readOct 6, 2022

Synthetic data can be generated for many different cases. For example, in the detection of rare diseases, it is hard to find data of ill patients. In such cases where one of the labels of target data is very scarce, the model cannot perform accurately. Similarly in the banking sector, fraud detection is crucial subject, but building a model with few fraud cases will not perform as we aim to.

Moreover, in the context of GDPR, sharing the personal data of patients or customers of a bank is mostly restricted. At this point, we need a solution to equalize the distribution of target data and conserve the privacy of the patients which brings us to the concept of creating synthetic data.

In this post, we will examine Generative Adversarial Model (GAN) to create synthetic data.

An AI created picture of “Synthetic Data Generation with GANs Model” title (Source: wombo.art)

Why Synthetic Data?

Generating the synthetic data enables us to create equally distributed target data and we can produce as much as we need, while the privacy of data is guarded in the cases when it is hard to build a machine learning model with the original dataset.

However, synthesizing data set has its own challenges and problems. If the synthesized data set is too similar to the original data set, it inherits the same issues the original data set has. Hence, it is critical to check the synthesized data set and decide how to use it.

In the next section, we explore the deep learning model GAN in a greater extent.

How does the GANs Model work?

Generative Adversarial Model is a deep learning model that has two opponent networks: the generator and the discriminator.

The generator synthesizes a new data set by using a noise vector. After the generation, the discriminator observes both the original data set and the synthesized data set. The purpose of the discriminator is to separate the synthetic data from the real data correctly. Basically, they both try to trick their opponents. In each step, the generator tries to generate data which is indistinguishable from the real data and the discriminator tries to find whether the data is real or synthetic.

After each step, the model renews itself by using backpropagation.

Value Function

x: Real data instance

z: Input noise vector provided to the generator

D(x): Probability of identifying the data as real

Ex: Expected value of real data

The value function consists of two parts. In the first part, x as input of D function represents the expected value of guessing the real data correctly by the discriminator. The second part of the equation takes the synthetic data -generated from the generator- as input. The probability of guessing the synthetic data as real data is the output of the D(G(z)). So, one minus the D(G(z)) shows us the probability of guessing the synthetic data as synthetic data, meaning that the discriminator can discriminate the data correctly.

The purpose of the generator is to minimize the second part of the equation. The generator wants to trick the discriminator and minimizing the second part means that the discriminator cannot separate the synthetic data from real data.

Conversely, the purpose of the discriminator is to maximize the whole value function. This means that the discriminator can distinguish the real data and the synthetic data.

The algorithm behind the GANs Model

The generator uses the Mini Batch Stochastic Gradient Descent algorithm to train. We will explain this algorithm step-by-step.

Gradient Descent is an algorithm that uses the gradient of the inputs to reach the local minimum from an initial point. In other words, by calculating the partial derivatives of the inputs, it tries to find the optimum weights of the function. The difference of the Stochastic Gradient Descent model is that it takes randomly one data point from the data set and calculate it’s gradient to update the weights. Finally, Mini Batch Stochastic Gradient Descent uses a batch of fixed number of training examples -not only one point- to feed the generator. Unlike the Stochastic Gradient Descent, the mean of the gradient of each batch is considered. The weights of the function are updated by using the mean gradient of the mini batches.

Summary of the process:

1. A batch of a fixed number of training examples is selected

2. Generator takes this batch as input

3. The mean gradient of the batch is calculated

4. Weights of the function are updated

How did we use it?

In the context of one of our classification projects, the data we were going to use, was unbalanced: the data of one of the labels was very few.

The first solution can be taking the same amount of data for both labels: taking as many data as the fewer label has.

For example, let’s consider a data set with 10000 values. The distribution of the target column is as follows: 9900 0’s and 100 1’s. We can train the model with 100 of 1’s and 100 of 0’s. But, since 200 data probably won’t be enough to train the model, we should increase the amount of the data.

The second solution is synthesizing data. So, we create synthetic data by using the one of the GAN models, DRAGAN, from the library of YData, ydata_synthetic.

While synthesizing the data, we had some struggles building the model and we addressed these with our solutions.

Struggle #1: Data Preprocessing

We pursued the basic data analysis methods in the first place: unique value control, null value control, and renaming some columns.

In the generation part, we tried to use the original data without applying feature engineering methods, but the similarity between the real data and the synthetic data was few. Plus, the features that must be positive in the real world, had some negative values.

Solution #1 Data Preprocessing

We used the data after applying the feature engineering methods. In addition to that, we scaled the data before the generation and check the columns whether they were grouped ( numerical columns or categorical columns) correctly or not: categorical value or number. As a result, the synthetic data became similar to the real data.

Struggle #2: Train Data Selection

For the modeling, we have already split the real data into two parts, train and test. After the generation of synthetic data from the train data, we couldn’t decide the percentage of the real data and the synthetic data that we have to take, for the model training.

Solution #2: Train Data Selection

We tried to address the problem by considering the purpose of the project. The recall score was important in the context of the project. So, we built two loops, one for sampling from the real data and the other for sampling from the synthetic data. Like doing cross-validation, we searched for the best recall-scored sample size group, and we tested the model with the real data that the model hasn’t seen yet.

Model performance: At the end of the process, we increased the recall score 0.13 points.

Conclusion

Synthesizing data is an important method to overcome the problems such as data privacy, data balance issues, or in the case of lack of data in existing dataset and it can be done using the GANs Model. During the training process, GANs Model uses two networks: “Generator” and “Discriminator”, and the purpose of each is to trick one another. As the generator wins, we could obtain a synthetic data similar to real data and we can use this data whenever needed.

References

“1.5. Stochastic Gradient Descent”. scikit-learn, https://scikit-learn/stable/modules/sgd.html.

Abbasi, Nouman. “What is a Conditional GAN (cGAN)?” Educative: Interactive Courses for Software Developers, https://www.educative.io/answers/what-is-a-conditional-gan-cgan.

Agrawal, Raghav. “An End-to-End Introduction to Generative Adversarial Networks(GANs)”. Analytics Vidhya, 20 Ekim 2021, https://www.analyticsvidhya.com/blog/2021/10/an-end-to-end-introduction-to-generative-adversarial-networksgans/.

Agrawal, Tanay. “GANs Failure Modes: How to Identify and Monitor Them”. neptune.ai, https://neptune.ai/blog/gan-failure-modes.

Ajay, Lakshmi. “Decoding the Basic Math in GAN — Simplified Version”. Medium, 24 Şubat 2022, https://towardsdatascience.com/decoding-the-basic-math-in-gan-simplified-version-6fb6b079793.

Alagözlü, Mert. Stochastic Gradient Descent Variants and Applications. 2022, https://doi.org/10.13140/RG.2.2.12528.53767.

Arjovsky, Martin, vd. Wasserstein GAN. arXiv, 26 Ocak 2017, https://doi.org/10.48550/arXiv.1701.07875.

Ashrapov, Insaf. “GANs for tabular data”. Medium, 26 Mart 2020, https://towardsdatascience.com/review-of-gans-for-tabular-data-a30a2199342.

Barla, Nilesh. “Generative Adversarial Networks and Some of GAN Applications: Everything You Need to Know”. neptune.ai, 21 Temmuz 2022, https://neptune.ai/blog/generative-adversarial-networks-gan-applications.

Biswal, Avijeet. “The Best Introduction to What Generative Adversarial Networks (GANs) Are”. Simplilearn.com, https://www.simplilearn.com/tutorials/deep-learning-tutorial/generative-adversarial-networks-gans.

Bok, Vladimir, ve Jakub Langr. “Chapter 8. Conditional GAN · GANs in Action: Deep learning with Generative Adversarial Networks”. GANs in Action , Manning Publications, https://livebook.manning.com/book/gans-in-action/chapter-8/1.

Bourou, Stavroula, vd. “A Review of Tabular Data Synthesis Using GANs on an IDS Dataset”. Information, c. 12, sy 9, Eylül 2021, s. 375, https://doi.org/10.3390/info12090375.

Brownlee , Jason. “How to Code the GAN Training Algorithm and Loss Functions”. Machine Learning Mastery , 12 Temmuz 2019, https://machinelearningmastery.com/how-to-code-the-generative-adversarial-network-training-algorithm-and-loss-functions/.

Brownlee, Jason. A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size. 21 Temmuz 2017, https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/.

— -. “https://machinelearningmastery.com/difference-between-backpropagation-and-stochastic-gradient-descent/”. Machine Learning Mastery , 01 Şubat 2021, https://machinelearningmastery.com/difference-between-backpropagation-and-stochastic-gradient-descent/.

— -. Machine Learning Mastery, 17 Haziran 2019, https://machinelearningmastery.com/what-are-generative-adversarial-networks-gans/.

Chiu, Michael, vd. GAN Foundations. https://www.cs.toronto.edu/~duvenaud/courses/csc2541/slides/gan-foundations.pdf.

Clemente, Fabiana. “What is going on with my GAN?” Medium, 13 Temmuz 2020, https://towardsdatascience.com/what-is-going-on-with-my-gan-13a00b88519e.

Dey, Victor. “Beginner’s Guide to Generative Adversarial Networks (GANs)”. Analytics India Magazine, 18 Temmuz 2021, https://analyticsindiamag.com/beginners-guide-to-generative-adversarial-networks-gans/.

Dillon, ve Aadil Hayat. “Building a simple Generative Adversarial Network (GAN) using TensorFlow”. Paperspace Blog, https://blog.paperspace.com/implementing-gans-in-tensorflow/.

Donges, Niklas. “Gradient Descent: An Introduction to 1 of Machine Learning’s Most Popular Algorithms”. builtin, 23 Temmuz 2021, https://builtin.com/data-science/gradient-descent.

Dwivedi, Harshit. “Understanding GAN Loss Functions”. neptune.ai, https://neptune.ai/blog/gan-loss-functions.

Ferreira, Pedro. “Towards data set augmentation with GANs”. Jungle Book, 05 Ekim 2017, https://medium.com/jungle-book/towards-data-set-augmentation-with-gans-9dd64e9628e6.

Gandhi, Rohith. “Generative Adversarial Networks — Explained”. Medium, 10 Mayıs 2018, https://towardsdatascience.com/generative-adversarial-networks-explained-34472718707a.

“Generative Adversarial Network (GAN)”. GeeksforGeeks, https://www.geeksforgeeks.org/generative-adversarial-network-gan/.

Goodfellow, Ian J., vd. Generative Adversarial Networks. arXiv, 10 Haziran 2014, http://arxiv.org/abs/1406.2661.

Huang, Daniel. “Synthetic data generation using Generative Adversarial Networks (GANs): Part 1”. Data Science at Microsoft, 01 Haziran 2021, https://medium.com/data-science-at-microsoft/synthetic-data-generation-using-generative-adversarial-networks-gans-part-1-47ecbf46b575.

Hui, Jonathan. “GAN — RSGAN & RaGAN (A new generation of cost function.)”. Medium, 07 Temmuz 2018, https://jonathan-hui.medium.com/gan-rsgan-ragan-a-new-generation-of-cost-function-84c5374d3c6e.

— -. “GAN — Wasserstein GAN & WGAN-GP”. Medium, 14 Haziran 2018, https://jonathan-hui.medium.com/gan-wasserstein-gan-wgan-gp-6a1a2aa1b490.

— -. “GAN — Ways to improve GAN performance”. Medium, 19 Haziran 2018, https://towardsdatascience.com/gan-ways-to-improve-gan-performance-acf37f9f59b.

— -. “GAN — What is Generative Adversarial Networks GAN?” Medium, 19 Haziran 2018, https://jonathan-hui.medium.com/gan-whats-generative-adversarial-networks-and-its-application-f39ed278ef09.

— -. “GAN — Why it is so hard to train Generative Adversarial Networks!” Medium, 21 Haziran 2018, https://jonathan-hui.medium.com/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b.

Kishore, Pankaj. “Art of Generative Adversarial Networks (GAN)”. Medium, 17 Mart 2019, https://towardsdatascience.com/art-of-generative-adversarial-networks-gan-62e96a21bc35.

— -. “Generative Adversarial Networks (GAN)- An AI — ‘Cat and Mouse Game’”. Medium, 16 Aralık 2018, https://towardsdatascience.com/art-of-generative-adversarial-networks-gan-62e96a21bc35.

Lazarou, Conor. “Why Do GANs Need So Much Noise?” Medium, 26 Şubat 2020, https://towardsdatascience.com/why-do-gans-need-so-much-noise-1eae6c0fb177.

“Loss Functions”. Google Developers, https://developers.google.com/machine-learning/gan/loss.

Manisha, Padala, vd. Effect of Input Noise Dimension in GANs. 16 Nisan 2020, https://arxiv.org/pdf/2004.06882.pdf.

“ML | Stochastic Gradient Descent (SGD)”. GeeksforGeeks, https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/.

Nicholson, Chris. “A Beginner’s Guide to Generative Adversarial Networks (GANs)”. pathmind, https://wiki.pathmind.com/generative-adversarial-network-gan#:~:text=Generative%20adversarial%20networks%20(GANs)%20are,video%20generation%20and%20voice%20generation.

Olorunniwo, Taiwo, vd. Stochastic gradient descent. https://optimization.cbe.cornell.edu/index.php?title=Stochastic_gradient_descent.

“Overview of GAN Structure ”. Google Developers, https://developers.google.com/machine-learning/gan/gan_structure.

Patrikar, Sushant. “Batch, Mini Batch & Stochastic Gradient Descent”. Medium, 01 Ekim 2019, https://towardsdatascience.com/batch-mini-batch-stochastic-gradient-descent-7a62ecba642a.

Pinetz, Thomas. “Answer to ‘Adjusting GAN hyperparameters’”. Stack Overflow, 24 Eylül 2017, https://stackoverflow.com/a/46390590.

Ponte, Norman. “What Are GANs? Generative Adversarial Networks Explained”. Zumo Labs, 18 Mayıs 2021, https://www.zumolabs.ai/post/what-are-gans.

“Reducing Loss: Stochastic Gradient Descent”. Google Developers, https://developers.google.com/machine-learning/crash-course/reducing-loss/stochastic-gradient-descent.

Rocca, Joseph. “Understanding Generative Adversarial Networks (GANs)”. Medium, 08 Ocak 2019, https://towardsdatascience.com/understanding-generative-adversarial-networks-gans-cd6e4651a29.

Sanjeevi, Madhu. “Ch:14 Generative Adversarial Networks (GAN’s) with Math.” Deep Math Machine learning.ai, 14 Ocak 2019, https://medium.com/deep-math-machine-learning-ai/ch-14-general-adversarial-networks-gans-with-math-1318faf46b43.

Sankar, Aadhithya. “Demystified: Wasserstein GAN with Gradient Penalty(WGAN-GP)”. Medium, 02 Ekim 2021, https://towardsdatascience.com/demystified-wasserstein-gan-with-gradient-penalty-ba5e9b905ead.

Saxena, Pawan. “Synthetic Data Generation Using Conditional-GAN”. Medium, 12 Ağustos 2021, https://towardsdatascience.com/synthetic-data-generation-using-conditional-gan-45f91542ec6b.

Saxena, Shipra. “4 Impressive GAN Libraries Every Data Scientist Should Know!” Analytics Vidhya, 26 Ağustos 2020, https://www.analyticsvidhya.com/blog/2020/08/top-5-gan-libraries-you-must-know/.

Sharma, Aditya. Introduction to Generative Adversarial Networks (GANs). 28 Haziran 2021, https://learnopencv.com/introduction-to-generative-adversarial-networks/.

Sharma, Pulkit. “What are Generative Models and GANs? The Magic of Computer Vision”. Analytics Vidhya, 13 Ocak 2020, https://www.analyticsvidhya.com/blog/2020/01/generative-models-gans-computer-vision/.

Srinivasan, Aishwarya V. “Stochastic Gradient Descent — Clearly Explained !!” Medium, 07 Eylül 2019, https://towardsdatascience.com/stochastic-gradient-descent-clearly-explained-53d239905d31.

“Stochastic Gradient Descent”. Deep AI, https://deepai.org/machine-learning-glossary-and-terms/stochastic-gradient-descent.

“ — -”. kaggle, https://www.kaggle.com/code/ryanholbrook/stochastic-gradient-descent.

Stojiljković, Mirko. “Stochastic Gradient Descent Algorithm With Python and NumPy”. Real Python, https://realpython.com/gradient-descent-algorithm-python/.

Tae, Jake. The Math Behind GANs. https://jaketae.github.io/study/gan-math/.

“The Discriminator”. Google Developers, https://developers.google.com/machine-learning/gan/discriminator.

“The Generator”. Google Developers, https://developers.google.com/machine-learning/gan/generator.

Unzueta, Diego. “How to Generate Tabular Data Using CTGANs”. Medium, 09 Kasım 2021, https://towardsdatascience.com/how-to-generate-tabular-data-using-ctgans-9386e45836a6.

Vadsola, Mayank. “The math behind GANs (Generative Adversarial Networks)”. Medium, 01 Ocak 2020, https://towardsdatascience.com/the-math-behind-gans-generative-adversarial-networks-3828f3469d9c.

Ye, Andre. “GANs for Everyone”. Medium, 26 Nisan 2020, https://towardsdatascience.com/gans-for-everyone-an-intuitive-explanation-of-the-revolutionary-concept-2f962c858b95.

--

--