Synthetic Data Generation for Tabular Data — GAN

Sanjjushri Varshini R
3 min readMar 24, 2024

--

A Streamlit Approach to Test Data Generation Using GAN (Generative Adversarial Network)

The demand for diverse datasets continues to soar in today’s data-driven world. However, acquiring real-world data for analysis and model training often poses significant challenges due to privacy concerns, data scarcity, or regulatory constraints. Enter synthetic data generation — a powerful technique that addresses these hurdles by creating artificial data that mimics the statistical properties of real data while preserving privacy and confidentiality. In this article will delve into synthetic data generation for tabular data and deploying it in streamlit.

Install the following:

pip install streamlit==1.31.1
pip install pandas==2.2.1
pip install ctgan== 0.9.0

In this coding method, users can conveniently upload a sample CSV file directly within the Streamlit user interface. They then have the flexibility to specify the desired number of rows for the synthetic data generation process. Once these inputs are provided, the application proceeds to generate synthetic data accordingly. This approach serves a multitude of purposes, including but not limited to test data generation. By offering such seamless functionality, this solution empowers users to efficiently create synthetic datasets tailored to their specific needs, thereby facilitating various data-driven tasks and analyses.

import streamlit as st
import pandas as pd
from ctgan import CTGAN # Assuming you have installed the CTGAN library

# File uploader to get the dataset
uploaded_file = st.file_uploader("Choose a file")

if uploaded_file is not None:
# Input field for the user to specify the number of rows for synthetic data generation
number = st.number_input('Number of rows', min_value=0, step=1000)

# Read the uploaded dataset into a DataFrame
df = pd.read_csv(uploaded_file)

# Identify categorical features in the dataset
categorical_features = df.select_dtypes(exclude="number").columns.tolist()

# Initialize CTGAN model with 5 epochs (adjust the number of epochs based on your requirements)
ctgan = CTGAN(epochs=5)

# Fit the CTGAN model to the original dataset
ctgan.fit(df, categorical_features)

# Generate synthetic data using CTGAN
synthetic_df = ctgan.sample(number)

# Display the synthetic data
st.write(synthetic_df)
  1. File Uploader: This part allows the user to upload a CSV file containing the original dataset.
  2. Number Input: Here, the user can specify the number of rows for the synthetic data to be generated.
  3. Read CSV: The uploaded CSV file is read into a pandas DataFrame.
  4. Identify Categorical Features: Categorical features are identified in the dataset. This information is used for training the CTGAN model.
  5. CTGAN Initialization: CTGAN model is initialized with 5 epochs. You can adjust this value based on your dataset and requirements.
  6. Model Fitting: The CTGAN model is fitted to the original dataset along with the identified categorical features.
  7. Synthetic Data Generation: Using the fitted CTGAN model, synthetic data is generated based on the specified number of rows.
  8. Display Synthetic Data: The synthetic data is showcased using Streamlit’s `st.write` function, enabling easy visualization within the application. Users can also download the synthetic data as a CSV file for further analysis.

Synthetic data generation offers a compelling solution to the challenges of acquiring and using real-world data, particularly for tabular datasets. By harnessing the power of statistical methods, machine learning techniques, and rule-based approaches, organizations can unlock new opportunities for data-driven innovation while ensuring privacy, security, and compliance. As we continue to explore and refine synthetic data generation methodologies, the potential for accelerating research, development, and decision-making across various domains remains limitless.

GitHub link: https://github.com/Sanjjushri/synthetic-data-generator

--

--