Generating and Populating Tables with Synthetic Data at Adjustable Rates
This Python-based data generation code has been specially designed to generate dummy data for an e-commerce application. Its primary target audience is those who wish to continuously populate their tables at a required rate.
Let’s Get Started
The E-Cart Data Generator has been Dockerized and can be seamlessly run using Docker Compose. Before you start, ensure you have Docker and pgAdmin installed on your machine.
To dive in, clone the repository and navigate to the appropriate directory as mentioned below and run Docker Compose to initialize the PostgreSQL database.
git clone https://github.com/Noosarpparashar/startupv2.git
cd startupv2/python/dataGenerator/ecart
docker-compose up -d
You can connect to this database using tools like pgAdmin or DBeaver using the following details:
Host: localhost
Port: 5433
Database: postgres
Password: 9473
Setting Up the Stage: Database Creation
Once the services are up and running, your next step is to use the provided DDL scripts to create the necessary database tables. Create a database named PINNACLEDB and an associated schema named ‘ecart’.
create DATABASE PINNACLEDB;
Go to PINNACLEDB and create schema inside it
create SCHEMA ECART;
Then, using the provided scripts, generate tables for Customer, ProductInfo, StoreInfo, and Fact_Order. Remember to ensure that all the below operations are performed within the PINNACLEDB database.
create table ECART.CUSTOMER (
CUSTID VARCHAR(50),
CUSTNAME VARCHAR(100),
CUSTADD VARCHAR(400)
);
create table ECART.PRODUCTINFO (
PRODUCTID INTEGER,
PRODUCTNAME VARCHAR(150),
PRODCAT VARCHAR(400),
STOREID varchar(70)
);
create table ECART.STOREINFO (
STOREID varchar(70),
STORENAME VARCHAR(150),
STOREADD VARCHAR(400)
);
create table ECART.FACT_ORDER (
ORDERID SERIAL PRIMARY key,
CUSTID VARCHAR(50),
PRODUCTID INTEGER,
PURCHASETIMESTAMP TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
Dockerize Python Code and create Docker Network
The next step is to Dockerize the Python code. Build the Docker image and create a Docker network for inter-container communication. Connect your PostgreSQL container to this network. Now, you’re ready to run the Docker container that hosts the data generator.
docker build -t my-ecart-data-generator .
docker network create my-network
Connect the PostgreSQL container to the Docker network. Replace ecart-db-1 with the name of your PostgreSQL Docker container:
docker network connect my-network ecart-db-1
Run the Docker container containing the data generator:
docker run --network=my-network my-ecart-data-generator
Note: Inside code we have given port as 5432 but to connect database on pgadmin or dbeaver use port as 5433 because in docker its running on 5432 but it maps to 5433 on localhost.
Once the Docker container is up and running, the data generator will start generating data to the tables. If you wish to increase the rate of data generation, modify the dataGenerator.py file and rerun the Docker build and run commands.
Connect pgadmin and run following commands to check if data is coming:
Note:- Make sure your database is PINNACLEDB only
Host: localhost
Port: 5433
Database: pinnacledb
Password: 9473
select * from ECART.CUSTOMER;
select * from ECART.PRODUCTINFO;
select * from ECART.FACT_ORDER;
select * from ECART.STOREINFO;
Once the data is successfully sourced, you can then proceed to more advanced operations such as preprocessing, creating a Dimensional Fact Model, or using Kafka or Kinesis streams. Additionally, you can explore debezium, AWS DMS, and other technologies based on your learning objectives and how you envision your data pipeline.
For more detailed instructions, I encourage you to visit the repository. Your contributions and feedback are most welcome as we journey together through the fascinating world of big data pipelines.
Happy data journeying!