A Step by Step Guide to Generate Tabular Synthetic Dataset with GANs

fzhurd
Analytics Vidhya
Published in
10 min readFeb 15, 2021

Goal

In this article, we will guide to generate tabular synthetic data with GANs. The generated data are expected to similar to real data for model training and testing.

Introduction

In the machine learning work, frequently we meet the situation the data are not enough for training models and we need more artificial data. GANs (Generative Adversarial Networks) is a deep learning architecture introduced by Ian Goodfellow etc. in 2014(1). GANs could generate synthetic data from scratch and comprise of two components: generator and discriminator. The generator is used to produce fake data from input random noise; The discriminator is used to classify the samples are real or fake (produced by generator). The performance of the discriminator is used to update and optimize the generator and discriminator. Currently GANS is popular applied to generate image data, but not many articles on the tabular data. One of the reason is non-image synthetic data is difficult to evaluate the quality. In this post, we will try to generate one dimensional synthetic data from scratch.

Dataset

The diabetes dataset is from Kaggle public datasets: https://www.kaggle.com/uciml/pima-indians-diabetes-database

Base Accuracy for Real Dataset

In this section we will use the real data to train a Random Forest model and get the accuracy of the model. The accuracy of the model trained from the real data is used as the base accuracy to compare with the generated fake data. The complete codes are available in the git repo: https://github.com/fzhurd/fzwork/tree/master/medium/ganspost

We input all the requested python modules first, read the csv file to pandas as Dataframe and explore the dataset roughly.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from numpy.random import randn
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
data = pd.read_csv('/content/diabetes.csv')
print (data.shape)
print (data.tail())
print (data.columns)

The diabetes dataset includes 9 columns: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age and OutCome. The OutCome column will be the label.

(768, 9)      
Pregnancies Glucose ... Age Outcome 763 10 101 ... 63 0 764 2 122 ... 27 0 765 5 121 ... 30 0 766 1 126 ... 47 1 767 1 93 ... 23 0 [5 rows x 9 columns]
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype='object')

We will use all columns except Outcome column as features to train the model. Outcome column will be used as the label of the model.

features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']label = ['Outcome']X = data[features]
y = data[label]

The real dataset is split into train and test dataset. The random forest classifier model is trained and evaluate the accuracy.

X_true_train, X_true_test, y_true_train, y_true_test = train_test_split(X, y, test_size=0.30, random_state=42)clf_true = RandomForestClassifier(n_estimators=100)
clf_true.fit(X_true_train,y_true_train)
y_true_pred=clf_true.predict(X_true_test)print("Base Accuracy:",metrics.accuracy_score(y_true_test, y_true_pred))print("Base classification report:",metrics.classification_report(y_true_test, y_true_pred))

We get the accuracy of the base model for real data is around 0.76; Precision is around 0.82. The accuracy of the model trained from real data will be the base accuracy to compare with the model trained from generated fake data in the further steps.

Base Accuracy: 0.7575757575757576 Base classification report:               precision    recall  f1-score   support             
0 0.82 0.81 0.81 151
1 0.65 0.66 0.65 80
accuracy 0.76 231
macro avg 0.73 0.74 0.73 231
weighted avg 0.76 0.76 0.76 231

Generate Synthetic Data

From this section we will start to generate fake data using GANs. First step, we define a generate_latent_points function, it will create random noise in the latent space and be reshaped to the dimensions for matching the input of generator model.

def generate_latent_points(latent_dim, n_samples):
x_input = randn(latent_dim * n_samples)
x_input = x_input.reshape(n_samples, latent_dim)
return x_input

We define the generate_fake_samples function to produce fake data. The input of the generator will be the created latent points (random noise). The generator will predict the input random noise and output a numpy array. Because it is the fake data, the label will be 0.

# use the generator to generate n fake examples, with class labels
def generate_fake_samples(generator, latent_dim, n_samples):
x_input = generate_latent_points(latent_dim, n_samples)
X = generator.predict(x_input)
y = np.zeros((n_samples, 1))

return X, y

We will define another function to generate real samples, it will randomly select samples from the real dataset. The label for the real data sample is 1.

# generate n real samples with class labels; We randomly select n samples from the real data
def generate_real_samples(n):
X = data.sample(n)
y = np.ones((n, 1))
return X, y

We will create a simple sequential model as generator with Keras module. The input dimension will be the same as the dimension of input samples. The kernel will be initialized by ‘ he_uniform ’. The model will have 3 layers, two layers will be activated by ‘relu’ function. The output layer will be activated by ‘linear’ function and the dimension of the output layer is the same as the dimension of the dataset (9 columns).

def define_generator(latent_dim, n_outputs=9):
model = Sequential()
model.add(Dense(15, activation='relu', kernel_initializer='he_uniform', input_dim=latent_dim))
model.add(Dense(30, activation='relu'))
model.add(Dense(n_outputs, activation='linear'))
return model

We could check the information of the generator model by inputting some parameter values.

generator1 = define_generator(10, 9)
generator1.summary()

The output of the generator model is as followed:

Model: "sequential" _________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= dense (Dense)                (None, 15)                165        _________________________________________________________________ dense_1 (Dense)              (None, 30)                480        _________________________________________________________________ dense_2 (Dense)              (None, 9)                 279        ================================================================= Total params: 924 Trainable params: 924 Non-trainable params: 0 _________________________________________________________________

After we have defined the generator, we will define the discriminator next step. The discriminator is also a simple sequential model including 3 dense layers. The first two layers are activated by ‘relu’ function, the output layer is activated by ‘sigmoid’ function because it will discriminate the input samples are real (True) or fake (False).

def define_discriminator(n_inputs=9):
model = Sequential()
model.add(Dense(25, activation='relu', kernel_initializer='he_uniform', input_dim=n_inputs))
model.add(Dense(50, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

return model

We could check the information of the discriminator model by inputting some parameter values.

discriminator1 = define_discriminator(9)
discriminator1.summary()

The output of the summary of discriminator is as followed:

Model: "sequential_7" _________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= dense_15 (Dense)             (None, 25)                250        _________________________________________________________________ dense_16 (Dense)             (None, 50)                1300       _________________________________________________________________ dense_17 (Dense)             (None, 1)                 51         ================================================================= Total params: 1,601 Trainable params: 1,601 Non-trainable params: 0

We will define the Gan model after we have define the generator and discriminator models. It is also a sequential model and combine generator with discriminator. NOTE: the discriminator model weight must be not trainable.

# define the combined generator and discriminator model, for updating the generatordef define_gan(generator, discriminator):
# make weights in the discriminator not trainable
discriminator.trainable = False
model = Sequential() # add generator
model.add(generator)
# add the discriminator
model.add(discriminator)
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam')
return model

We will make a plot_history function to visualize the final generator and discriminator loss in the plot.

# create a line plot of loss for the gan and save to file
def plot_history(d_hist, g_hist):
# plot loss
plt.subplot(1, 1, 1)
plt.plot(d_hist, label='d')
plt.plot(g_hist, label='gen')
plt.show()
plt.close()

Finally we will train the generator and discriminator. For each epoch, we will combine half batch of real data and half batch of fake data, then calculate the average loss. The combined model will be updated based on train_on_batch function. The trained generator will be saved for further use.

# train the generator and discriminator
def train(g_model, d_model, gan_model, latent_dim, n_epochs=10000, n_batch=128, n_eval=200):
# determine half the size of one batch, for updating the discriminator
half_batch = int(n_batch / 2)
d_history = []
g_history = []
# manually enumerate epochs
for epoch in range(n_epochs):

# prepare real samples
x_real, y_real = generate_real_samples(half_batch)
# prepare fake examples
x_fake, y_fake = generate_fake_samples(g_model, latent_dim, half_batch)
# update discriminator
d_loss_real, d_real_acc = d_model.train_on_batch(x_real, y_real)
d_loss_fake, d_fake_acc = d_model.train_on_batch(x_fake, y_fake)
d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)
# prepare points in latent space as input for the generator
x_gan = generate_latent_points(latent_dim, n_batch)
# create inverted labels for the fake samples
y_gan = np.ones((n_batch, 1))
# update the generator via the discriminator's error
g_loss_fake = gan_model.train_on_batch(x_gan, y_gan)
print('>%d, d1=%.3f, d2=%.3f d=%.3f g=%.3f' % (epoch+1, d_loss_real, d_loss_fake, d_loss, g_loss_fake)) d_history.append(d_loss)
g_history.append(g_loss_fake)
plot_history(d_history, g_history) g_model.save('trained_generated_model.h5')

We input latent_dim value is 10 to start the training.

# size of the latent space
latent_dim = 10
# create the discriminator
discriminator = define_discriminator()
# create the generator
generator = define_generator(latent_dim)
# create the gan
gan_model = define_gan(generator, discriminator)
# train model
train(generator, discriminator, gan_model, latent_dim)

The training procedure will take a few minutes depending on your computer and is as followed:

.........
>9991, d1=0.858, d2=0.674 d=0.766 g=0.904
>9992, d1=1.023, d2=0.833 d=0.928 g=0.816
>9993, d1=0.737, d2=0.863 d=0.800 g=0.910
>9994, d1=0.780, d2=0.890 d=0.835 g=0.846
>9995, d1=0.837, d2=0.773 d=0.805 g=0.960
>9996, d1=0.762, d2=0.683 d=0.723 g=1.193
>9997, d1=0.906, d2=0.515 d=0.710 g=1.275
>9998, d1=0.814, d2=0.412 d=0.613 g=1.228
>9999, d1=0.701, d2=0.668 d=0.685 g=1.105 >10000, d1=0.461, d2=0.814 d=0.638 g=1.097

The Loss of generator and discriminator changes are plotted as followed: blue-loss of discriminator; orange-loss of generator

Evaluate the Quality of Generated Fake Data With Model

We have trained the generator successfully in the above steps. From this section, we will produce the fake data with the trained model and test the quality of the fake data. First, we will load the trained generator model.

from keras.models import load_model
model =load_model('/content/trained_generated_model')

We will create fake data with the trained generator model. The fake data are 750 rows. Then we convert the created fake data to pandas Dataframe.

latent_points = generate_latent_points(10, 750)
X = model.predict(latent_points)
data_fake = pd.DataFrame(data=X, columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'])data_fake.head()

The output of the 5 rows fake data information are as followed:

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 
3.042421 84.372429 41.264584 15.499371 75.576080 16.862654 0.643298 30.715979 0.131986
2.379814 65.569473 34.632591 9.681239 153.032700 14.792008 0.301202 11.963096 -0.200955
-0.212970 104.455383 40.059303 9.538709 0.783831 20.410034 0.439094 13.447835 0.229936
12.437524 257.148895 125.773453 2.465484 1.408619 50.760799 0.756833 113.432060 0.949813
3.571342 34.856190 30.242983 17.523539 1.804614 18.132822 0.289309 23.509460 -0.023842

The Outcome column in the real data is 0 or 1. Therefore, we need map the value of the generated fake data to 0 or 1.

outcome_mean = data_fake.Outcome.mean()data_fake['Outcome'] = data_fake['Outcome'] > outcome_mean
data_fake["Outcome"] = data_fake["Outcome"].astype(int)

We will do the same feature engineering in the fake data. The label is Outcome column.

features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']label = ['Outcome']
X_fake_created = data_fake[features]
y_fake_created = data_fake[label]

We will train the random forest classifier model with the fake data and get the accuracy. It will be used to compare with the accuracy of the base model accuracy.

X_fake_train, X_fake_test, y_fake_train, y_fake_test = train_test_split(X_fake_created, y_fake_created, test_size=0.30, random_state=42)clf_fake = RandomForestClassifier(n_estimators=100)
clf_fake.fit(X_fake_train,y_fake_train)
y_fake_pred=clf_fake.predict(X_fake_test)print("Accuracy of fake data model:",metrics.accuracy_score(y_fake_test, y_fake_pred))print("Classification report of fake data model:",metrics.classification_report(y_fake_test, y_fake_pred))

The outputs of the mode are as followed:

Accuracy of fake data model: 0.88 Classification report of fake data model: precision    recall  f1-score   support             
0 0.86 0.94 0.90 127
1 0.92 0.80 0.85 98
accuracy 0.88 225
macro avg 0.89 0.87 0.88 225
weighted avg 0.88 0.88 0.88 225

The accuracy of the new trained model with generated fake data is around 0.88; Compared with the model trained with real data is around 0.75. It seems the fake data model is still skewed compared with the real data.

Evaluate the Quality of Generated Fake Data With Table_evaluator

Table_evaluator is a library to evaluate how similar a synthesized dataset is to a real dataset. It is suitable for evaluating the generated synthetic data. First we will install the table_evaluator module.

!pip install table_evaluator

After the installation, we will use the table_evaluator to analyze the Outcome column compared with the Outcome column in the real data.

from table_evaluator import load_data, TableEvaluatortable_evaluator = TableEvaluator(data, data_fake)
table_evaluator.evaluate(target_col='Outcome')

Output of the similarities are as followed. We could find the generated synthetic data is similar to real data. The mean correlation between fake and real columns 0.9359 and the similarity score is around 0.6011.

Correlation metric: pearsonr

Classifier F1-scores and their Jaccard similarities:
f1_real f1_fake jaccard_similarity
index
LogisticRegression_real_testset 0.7467 0.6333 0.5075
LogisticRegression_fake_testset 0.4867 0.9267 0.3514
RandomForestClassifier_real_testset 0.7267 0.6133 0.4634
RandomForestClassifier_fake_testset 0.4467 0.9200 0.2658
DecisionTreeClassifier_real_testset 0.7200 0.6333 0.4634
DecisionTreeClassifier_fake_testset 0.4600 0.8733 0.3043
MLPClassifier_real_testset 0.6800 0.5600 0.3393
MLPClassifier_fake_testset 0.3800 0.9133 0.2000

Miscellaneous results:
Result
Column Correlation Distance RMSE 0.4230
Column Correlation distance MAE 0.3552
Duplicate rows between sets (real/fake) (0, 0)
nearest neighbor mean 1.5898
nearest neighbor std 0.7154

Results:
Result
Basic statistics 0.9364
Correlation column correlations 0.1430
Mean Correlation between fake and real columns 0.9337
1 - MAPE Estimator results 0.3912
Similarity Score 0.6011
{'1 - MAPE Estimator results': 0.3911924307763016,
'Basic statistics': 0.9364221364221365,
'Correlation column correlations': 0.1430372959033057,
'Mean Correlation between fake and real columns': 0.9336602090844196,
'Similarity Score': 0.6010780180465408}

With table_evaluator tool, we could also explore the real data and synthetic data with visualization plot as followed:

table_evaluator.visual_evaluation()

Summary

From the model accuracy evaluation and table_evaluator evaluation, we could draw the conclusion that some of the features match closely with those of real data. Some other features need to be trained further. We could work further on the model training and data normalization to obtain better results.

Reference

  1. https://arxiv.org/abs/1406.2661
  2. https://machinelearningmastery.com/how-to-develop-a-generative-adversarial-network-for-a-1-dimensional-function-from-scratch-in-keras/

--

--