Simultaneous generation of structured tabular data and images using GAN

Timur Abdualimov
8 min readJan 4, 2023

--

We know GAN well for its success in creating realistic images. We do not know so well about the formation of tabular data. However, they can be used in the simultaneous implementation of tabular data and images.

Why generate tabular data and an image at the same time?

I have created coronarography.ai application. Structured data (risk factors for the development of heart disease) and an ECG image are fed to the input of the neural network, and the output is a pathology of the main arteries of the heart. I became interested in testing the prediction accuracy of a trained neural network on synthetically generated data. Why not) Let’s augment the sample using GAN and see the accuracy of the trained neural network on synthetic data. For this, it is necessary to obtain these synthetic data.

Description of the problem.

We have structured data. They contain information on each observed patient in the form of the presence of risk factors for the development of cardiovascular diseases in binary form. An ECG image is attached to each observation. That is, one patient corresponds to risk factors and one ECG image taken up to a day before performing invasive coronary angiography, these are the data that the main trained neural network predicts.

We need to simultaneously generate structured data (risk factors and targets in the form of damage to the arteries of the heart) and a picture — an ECG image. I have not seen such examples in the literature, so that tabular data and a picture are generated at the same time. Well, let’s do it for the first time.) Let’s generate 1500000 synthetic observables in the form of tabular data and ECG images.

Block diagram of the study.

My image

Obtaining synthetic data using GAN.

100 random numbers with a normal distribution were fed to the generator input. The output generated an image (200, 200) and structured tabular data of size (1, 35). (one line, 35 columns). There was a generalization layer inside the generator to keep the data streaming between the table row and the image.

The input of the discriminator was a generated image of size (200, 200) along with real ECG images (200, 200) and generated tabular data of size (1, 35) along with real tabular data. At the output, the discriminator produced a binary classification corresponding to real data and synthetic ones.

Thus, it was necessary for two neural networks to outperform each other. One neural network tried to generate an image and a table that could not be distinguished from real ones by a discriminator. He, in turn, tried to look for features characteristic of a real image and table in order to distinguish the generated images and table from real ones.

My image

Implementation on TensorFlow.

Let’s take the GAN structure described in my article about ECG generation as a basis.

Used Libraries

import pandas as pd

import glob
import imageio
import matplotlib.pyplot as plt
import numpy as np
import os
import PIL
from PIL import Image
from tensorflow.keras import layers
import time
import tensorflow as tf
from IPython import display
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib
from matplotlib.pyplot import figure
from sklearn.preprocessing import MinMaxScaler
import joblib

import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
print(tf.__version__)
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (8,6)

Loading and preparing a dataset

Here we are loading ECG images from a folder into an array. Convert to a single-channel (black and white) image, normalize it. We carry out a small correction and normalization of tabular data.

# table data
data = pd.read_csv('../AI_coronarography/DATA_WORK/DATA_WORK/DP_cor.csv', sep=';')
data.drop(['FIO', 'number_of_affected_coronary_artery'], axis = 1, inplace=True)
for i in [
'trunk_st',
'LAD_st',
'lcx_stenosis',
'RCA_stenosis'
]:
data[i] = data[i].apply(lambda x: 1 if x >= 50 else 0)

scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(data)
data[data.columns] = scaler.transform(data[data.columns])

# image
data_image = []
for k in os.listdir('../AI_coronarography/DATA_WORK/DATA_WORK/ЭКГ'):
if k.endswith('.jpg'):
img = Image.open('../AI_coronarography/DATA_WORK/DATA_WORK/ЭКГ/'+k)
img = img.convert('L')
img = img.resize((200, 200))
data_image += [(np.array(img) - 127.5) / 127.5]

Example of uploaded ECG image

My image

Let’s determine the batch size and combine images and structured data to enter the GAN. Let’s mix the combined ECG images and tabular data.

train_images = np.array(data_image).reshape(np.array(data_image).shape[0], 200, 200, 1).astype('float32')
train_data = np.array(data).reshape(np.array(data).shape[0], 35).astype('float32')
BUFFER_SIZE = 100
BATCH_SIZE = 10
train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_data)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

Let’s create a generator

def make_generator_model():

input_1 = Input(shape=(100, ), name = "InputRandomNoise")

x = Dense(25*25*256, use_bias=False)(input_1)
x = BatchNormalization()(x)
conc = LeakyReLU()(x)

x = Reshape((25, 25, 256))(conc)
x = Conv2DTranspose(256, (5, 5), strides=(1, 1), padding='same', use_bias=False)(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)
x = Conv2DTranspose(128, (5, 5), strides=(2, 2), padding='same', use_bias=False)(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)
x = Conv2DTranspose(64, (5, 5), strides=(2, 2), padding='same', use_bias=False)(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)
x = Conv2DTranspose(1, (5, 5), strides=(2, 2), padding='same', use_bias=False, activation='tanh', name = "OutputImage")(x)

y = Dense(400)(conc)
y = BatchNormalization()(y)
y = LeakyReLU()(y)
y = Dense(200)(y)
y = BatchNormalization()(y)
y = LeakyReLU()(y)
y = Dense(128)(y)
y = BatchNormalization()(y)
y = LeakyReLU()(y)
y = Dense(64)(y)
y = BatchNormalization()(y)
y = LeakyReLU()(y)
y = Dense(35, activation='sigmoid', name = "OutputTableData")(y)


model = Model(inputs=input_1, outputs=[x, y])

return model

generator = make_generator_model()

noise = tf.random.normal([1, 100])
generated_image = generator(noise, training=False)

Generator structure

My image

Let’s create a discriminator

def make_discriminator_model():
input_1 = Input(shape=(200, 200, 1), name = "InputImage")
input_2 = Input(shape=(35,), name = "InputTableData")

x = Conv2D(64, (5, 5), strides=(2, 2), padding='same')(input_1)
x = LeakyReLU()(x)
x = Dropout(0.3)(x)
x = Conv2D(128, (5, 5), strides=(2, 2), padding='same')(x)
x = LeakyReLU()(x)
x = Dropout(0.3)(x)
x = Conv2D(256, (5, 5), strides=(2, 2), padding='same')(x)
x = LeakyReLU()(x)
x = Dropout(0.3)(x)
x = Conv2D(256, (5, 5), strides=(1, 1), padding='same')(x)
x = LeakyReLU()(x)
x = Dropout(0.3)(x)
x = Flatten()(x)

y = Dense(400)(input_2)
y = LeakyReLU()(y)
y = Dropout(0.3)(y)
y = Dense(200)(y)
y = LeakyReLU()(y)
y = Dropout(0.3)(y)
y = Dense(128)(y)
y = LeakyReLU()(y)
y = Dropout(0.3)(y)
y = Dense(64)(y)
y = LeakyReLU()(y)
y = Dropout(0.3)(y)
z = concatenate([x, y])
z = Dense(25)(z)
z = Dropout(0.3)(z)
z = Dense(1)(z)

model = Model(inputs=[input_1, input_2], outputs=z)

return model

Structure of the discriminator

My image

Define loss functions and optimizers for both models, create checkpoints

cross_entropy = tf.keras.losses.BinaryCrossentropy(from_logits=True)

def discriminator_loss(real_output, fake_output):
real_loss = cross_entropy(tf.ones_like(real_output), real_output)
fake_loss = cross_entropy(tf.zeros_like(fake_output), fake_output)
total_loss = real_loss + fake_loss
return total_loss

def generator_loss(fake_output):
return cross_entropy(tf.ones_like(fake_output), fake_output)

generator_optimizer = tf.keras.optimizers.Adam(1e-4)
discriminator_optimizer = tf.keras.optimizers.Adam(1e-4)

checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(generator_optimizer=generator_optimizer,
discriminator_optimizer=discriminator_optimizer,
generator=generator,
discriminator=discriminator)

Defining a learning cycle

The learning cycle begins with the generator receiving a random seed as input. This number is used to generate the ECG and tabular data. The discriminator is then used to classify real images and tabular data (extracted from the training set) and fake images and tabular data (generated by the generator). The loss is calculated for each of these models, and the gradients are used to update the generator and discriminator.

EPOCHS = 5000
noise_dim = 100
num_examples_to_generate = 16

seed = tf.random.normal([num_examples_to_generate, noise_dim])

@tf.function
def train_step(images):
noise = tf.random.normal([BATCH_SIZE, noise_dim])

with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
generated_images = generator(noise, training=True)

real_output = discriminator(images, training=True)
fake_output = discriminator(generated_images, training=True)

gen_loss = generator_loss(fake_output)
disc_loss = discriminator_loss(real_output, fake_output)

gradients_of_generator = gen_tape.gradient(gen_loss, generator.trainable_variables)
gradients_of_discriminator = disc_tape.gradient(disc_loss, discriminator.trainable_variables)

generator_optimizer.apply_gradients(zip(gradients_of_generator, generator.trainable_variables))
discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator, discriminator.trainable_variables))

def train(dataset, epochs):
for epoch in range(epochs):
start = time.time()

for image_batch in dataset:
train_step(image_batch)

display.clear_output(wait=True)
generate_and_save_images(generator,
epoch + 1,
seed)

if (epoch + 1) % 500 == 0:
checkpoint.save(file_prefix = checkpoint_prefix)

print ('Time for epoch {} is {} sec'.format(epoch + 1, time.time()-start))

# Generate after the final epoch
display.clear_output(wait=True)
generate_and_save_images(generator,
epochs,
seed)

def generate_and_save_images(model, epoch, test_input):

predictions = model(test_input, training=False)

fig = plt.figure(figsize=(7, 7))

for i in range(predictions[0].shape[0]):
plt.subplot(4, 4, i+1)
plt.imshow(predictions[0][i, :, :, 0] * 127.5 + 127.5, cmap='gray')
plt.axis('off')

plt.savefig('image_at_epoch.png'.format(epoch))
plt.show()

Model Training

Call the train() method defined above to train the generator and discriminator at the same time. It is important that the generator and discriminator do not overwhelm each other (for example, that they learn at the same rate).

train(train_dataset, EPOCHS)

We can monitor learning visually from ECG images.

At the beginning of training, the generated images look like random noise. As the neural networks learn, the generated ECG images will look more and more real.

Let’s generate 1500000 synthetic observables and save them.

dataset_pred = []

for i in tqdm(range(0, 1500000)):
noise = tf.random.normal([1, 100])
generated_image = generator(noise, training=False)

Image.fromarray(np.array(generated_image[0] * 127.5 + 127.5, dtype='uint8').reshape(200, 200)).save(f"./new_GANECG/{i}.jpg")

dataset_pred.append(generated_image[1].numpy())

with open('dataset_pred.pickle', 'wb') as f:
pickle.dump(dataset_pred, f)

Let’s compare the ECG images. Externally, ECG images are practically indistinguishable from real ones.

My image

A more difficult problem is tabular data, how close are they to real ones?

The distribution of patients by age was analyzed. The distribution of real data is normal, synthetic data are distributed with three peaks, towards the median, minimum and maximum values.

My image. Violin diagram of the age distribution of real and generated data.

A quantitative analysis of the generated features was performed. The distribution is close to the real one, however, the revealed differences are in the quantitative relation of the signs.

My image. Quantitative distribution of real and generated data.

A heat map was created for comparing basic descriptive statistics (median, mean 25 quantile, 75 quantile, minimum and maximum values). Significant differences were obtained in half of the signs.

My image. Heat map of the difference in base descriptive statistics.

A heat map of the difference between the correlation matrices of real and synthetic datasets has been created. The main correlation components are preserved.

My image. Heat map of the difference in the correlation matrices of the datasets.

The calculation and visualization of the principal components (PCA) of the real and generated datasets were carried out.

My image. The main components of the real and generated datasets.

The t-distributed Stochastic Neighbor Embedding (t-SNE) is visualized.

My image. TSNE of real and generated datasets.

Comparing the synthetic data with the real ones, we can conclude that the generated data are close to the real ones. The main basic flow dependencies of features are preserved, however, the generated dataset does not completely copy the dependencies of the real one, so we can conclude that there are new excellent “random” observations.

Comparison results of coronarography.ai prediction accuracy on input synthetic data.

Actually that’s why all the fuss. What is the accuracy of Carl?????

The prediction of damage to the main coronary arteries and transient myocardial ischemia for 1,500,000 synthetic observations was carried out.

The AUC score was 0.79. Accuracy reached 88%, “precision” accuracy (precision) — 73%, recall (recall) — 63%, f1 score — 67%.

+-------------------------------------------------------------------------+-----+----------+-----------+--------+----------+
| Predicting damage to the main coronary arteries and myocardial ischemia | AUC | Accuracy | Precision | Recall | F1 score |
+-------------------------------------------------------------------------+-----+----------+-----------+--------+----------+
| Non-invasive predictive AI coronary angiography | 79 | 88 | 73 | 63 | 67 |
+-------------------------------------------------------------------------+-----+----------+-----------+--------+----------+

Mission complete!

--

--