Generating and Populating Tables with Synthetic Data at Adjustable Rates

Prasoon Parashar
3 min readJun 17, 2023

--

This Python-based data generation code has been specially designed to generate dummy data for an e-commerce application. Its primary target audience is those who wish to continuously populate their tables at a required rate.

Photo by Chris Ried on Unsplash

Let’s Get Started

The E-Cart Data Generator has been Dockerized and can be seamlessly run using Docker Compose. Before you start, ensure you have Docker and pgAdmin installed on your machine.
To dive in, clone the repository and navigate to the appropriate directory as mentioned below and run Docker Compose to initialize the PostgreSQL database.

git clone https://github.com/Noosarpparashar/startupv2.git
cd startupv2/python/dataGenerator/ecart
docker-compose up -d

You can connect to this database using tools like pgAdmin or DBeaver using the following details:

Host: localhost
Port: 5433
Database: postgres
Password: 9473

Setting Up the Stage: Database Creation

Once the services are up and running, your next step is to use the provided DDL scripts to create the necessary database tables. Create a database named PINNACLEDB and an associated schema named ‘ecart’.

   create DATABASE PINNACLEDB;

Go to PINNACLEDB and create schema inside it

create SCHEMA ECART;

Then, using the provided scripts, generate tables for Customer, ProductInfo, StoreInfo, and Fact_Order. Remember to ensure that all the below operations are performed within the PINNACLEDB database.

create table ECART.CUSTOMER (
CUSTID VARCHAR(50),
CUSTNAME VARCHAR(100),
CUSTADD VARCHAR(400)
);
create table ECART.PRODUCTINFO (
PRODUCTID INTEGER,
PRODUCTNAME VARCHAR(150),
PRODCAT VARCHAR(400),
STOREID varchar(70)
);

create table ECART.STOREINFO (
STOREID varchar(70),
STORENAME VARCHAR(150),
STOREADD VARCHAR(400)
);
create table ECART.FACT_ORDER (
ORDERID SERIAL PRIMARY key,
CUSTID VARCHAR(50),
PRODUCTID INTEGER,
PURCHASETIMESTAMP TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

Dockerize Python Code and create Docker Network

The next step is to Dockerize the Python code. Build the Docker image and create a Docker network for inter-container communication. Connect your PostgreSQL container to this network. Now, you’re ready to run the Docker container that hosts the data generator.

docker build -t my-ecart-data-generator .
docker network create my-network

Connect the PostgreSQL container to the Docker network. Replace ecart-db-1 with the name of your PostgreSQL Docker container:

docker network connect my-network ecart-db-1

Run the Docker container containing the data generator:

docker run --network=my-network my-ecart-data-generator

Note: Inside code we have given port as 5432 but to connect database on pgadmin or dbeaver use port as 5433 because in docker its running on 5432 but it maps to 5433 on localhost.

Once the Docker container is up and running, the data generator will start generating data to the tables. If you wish to increase the rate of data generation, modify the dataGenerator.py file and rerun the Docker build and run commands.

Connect pgadmin and run following commands to check if data is coming:

Note:- Make sure your database is PINNACLEDB only

Host: localhost
Port: 5433
Database: pinnacledb
Password: 9473

   select * from ECART.CUSTOMER;
  
select * from ECART.PRODUCTINFO;
   select * from ECART.FACT_ORDER;
   select * from ECART.STOREINFO;

Once the data is successfully sourced, you can then proceed to more advanced operations such as preprocessing, creating a Dimensional Fact Model, or using Kafka or Kinesis streams. Additionally, you can explore debezium, AWS DMS, and other technologies based on your learning objectives and how you envision your data pipeline.

For more detailed instructions, I encourage you to visit the repository. Your contributions and feedback are most welcome as we journey together through the fascinating world of big data pipelines.

Happy data journeying!

--

--

Prasoon Parashar

"Data Engineer specializing in cloud tech and big data. Passionate about creating innovative, scalable solutions. Lifelong learner. #DataScience #Cloud"