Implementing a Custom Simple Data Pipeline And Superset
A real-life case: Python, Crontab, Docker, Superset, SQLite, and PostgreSQL
Hi. In this story, the topic that we will be talking about it will be one of the jobs I finished recently for a client. I implemented a custom data pipeline to create their BI reports. And then, I set up a Superset instance to make them have their own BI tool.
Python was the main tool I was using to create the pipeline. I coded a custom script to extract the data from the
SQLite database on the server and then load it into the new
Postgres database I created as a warehouse on the server. After implementing the extract and load into the warehouse processes, I scheduled the refresh frequency of the data by using
crontab on the server. Lastly, I set up a Superset instance to create BI reports. In addition, I set up a
PgAdmin instance for myself easily to fix issues when needed. Let’s explore each step I took.
Set-up Postgres & PgAdmin
A new Postgres data warehouse was created at this step. And also, a PgAdmin instance was created too at this step to monitor and manage the Postgres data warehouse.
To create a Postgres database, I got help from Docker. I also bind a volume to warehouse container to store the data even it is restarted etc.
#! /bin/bashsudo docker run --name wh-db -p 5434:5432 -e POSTGRES_PASSWORD=MY_PASSWORD -d -v /tmp/path/:/var/lib/postgresql/data postgres
Then, I created a new container to up PgAdmin.
#! /bin/bashsudo docker container run --name pgadmin-wh -d -e PGADMIN_DEFAULT_EMAIL=info@MYMAIL.com -e PGADMIN_DEFAULT_PASSWORD=MY_PASSWORD -e PGADMIN_LISTEN_PORT=5435 -p 5435:5435 dpage/pgadmin4
Nginx Configuration For PgAdmin
I also bind a domain with the PgAdmin instance to make it accessible on web. To do that I used the configuration gist below.
Implementing ETL Script
I created a
script.py file in the web app’s folder. In this file, I coded the ETL process. Ofcourse, I created a virtualenv and installed the following packages:
#! ./venv/bin/python3import pandas as pd
from sqlalchemy import create_engine# Read sqlite query results into a pandas DataFrame
con = sqlite3.connect("/home/path/project/db.sqlite3")
df = pd.read_sql_query("SELECT * from <TABLE_NAME>", con)
df['revenue'] = pd.to_numeric(df['revenue']) / 100
df['datetime'] = pd.to_datetime(df['datetime'])con.close()# PostgreSQL connection
engine = create_engine('postgresql://postgres:<PASSWORD>@localhost:5434/postgres')# Load the data into the warehouse
df.to_sql('<TABLE_NAME>', con=engine, if_exists='replace',index=False)
The first line of the script makes itself that when it executed it will execute itself by using the python environment which is located in the first line.
Basically, the script does is extracting the transactions table from the prod db and loading the data into the warehouse we created by using Postgres after some data type transformations. There is another so important thing which is to provide refreshness the data. Let’s explore how I achieved that in the next step.
Providing Fresh Data by Using Crontab
crontab is a built-in program in most of Linux distros. We can schedule our cron jobs by using it. Also, if you wan to have deep knowledge about crontab, you should visit the link below.
crontab(5) - Linux manual page
NAME | DESCRIPTION | EXAMPLE CRON FILE | Jobs in /etc/cron.d/ | EXAMPLE OF A JOB IN /etc/cron.d/job | SELinux with…
We can use it by executing
crontab -e command in the terminal. I added the line below into the crontab file.
0 0,8,16 * * * /home/path/project/venv/bin/python3 /home/path/project/script.py
I provide fresh data at every 00.00, 08.00 and 16.00 everyday. You can learn more about how to schedule time with crontab by visiting the link below.
Crontab.guru - The cron schedule expression editor
The quick and simple editor for cron schedule expressions by Cronitor We created Cronitor because cron itself can't…
Actually, we have created a simple ETL pipeline by using pure Python. Now, we need to create reports by using this data. Let’s see how to create reports in the next section.
Set-up Superset & Creating BI Reports
To create a Superset instance, I got help from my resources I created about the exact same problem.
GitHub - mebaysan/Superset-Production-Environment: I created this repo to prepare a Superset…
Actually, this is a simple repo that holds some scripts. So, why I created this? I suffered while trying to run…
You can easily up the Superset by using the repo. I just cloned the repo into the client’s server and then up the instance. The below image is an example one of the client’s reports.
I also used the script I mentioned above to bind a domain into the Superset instance.
Hopefully, you enjoyed the story. I thought it would be helpful to share a real-life case. I know the processes would be improved. However, the client’s needs were so simple. Therefore, I chose the easiest ways to solve their problem.