CodeX
Published in

CodeX

Implementing a Custom Simple Data Pipeline And Superset

A real-life case: Python, Crontab, Docker, Superset, SQLite, and PostgreSQL

Hi. In this story, the topic that we will be talking about it will be one of the jobs I finished recently for a client. I implemented a custom data pipeline to create their BI reports. And then, I set up a Superset instance to make them have their own BI tool.

Photo by Emily Morter on Unsplash

Introduction

Python was the main tool I was using to create the pipeline. I coded a custom script to extract the data from the SQLite database on the server and then load it into the new Postgres database I created as a warehouse on the server. After implementing the extract and load into the warehouse processes, I scheduled the refresh frequency of the data by using crontab on the server. Lastly, I set up a Superset instance to create BI reports. In addition, I set up a PgAdmin instance for myself easily to fix issues when needed. Let’s explore each step I took.

Set-up Postgres & PgAdmin

A new Postgres data warehouse was created at this step. And also, a PgAdmin instance was created too at this step to monitor and manage the Postgres data warehouse.

To create a Postgres database, I got help from Docker. I also bind a volume to warehouse container to store the data even it is restarted etc.

#! /bin/bashsudo docker run --name wh-db -p 5434:5432 -e POSTGRES_PASSWORD=MY_PASSWORD -d -v /tmp/path/:/var/lib/postgresql/data postgres

Then, I created a new container to up PgAdmin.

#! /bin/bashsudo docker container run --name pgadmin-wh -d -e PGADMIN_DEFAULT_EMAIL=info@MYMAIL.com -e PGADMIN_DEFAULT_PASSWORD=MY_PASSWORD -e PGADMIN_LISTEN_PORT=5435 -p 5435:5435 dpage/pgadmin4

Nginx Configuration For PgAdmin

I also bind a domain with the PgAdmin instance to make it accessible on web. To do that I used the configuration gist below.

The instance I deployed

Implementing ETL Script

I created a script.py file in the web app’s folder. In this file, I coded the ETL process. Ofcourse, I created a virtualenv and installed the following packages:

  • pandas
  • sqlalchemy
#! ./venv/bin/python3import pandas as pd
import sqlite3
from sqlalchemy import create_engine
# Read sqlite query results into a pandas DataFrame
con = sqlite3.connect("/home/path/project/db.sqlite3")
df = pd.read_sql_query("SELECT * from <TABLE_NAME>", con)
df['revenue'] = pd.to_numeric(df['revenue']) / 100
df['datetime'] = pd.to_datetime(df['datetime'])
con.close()# PostgreSQL connection
engine = create_engine('postgresql://postgres:<PASSWORD>@localhost:5434/postgres')
# Load the data into the warehouse
df.to_sql('<TABLE_NAME>', con=engine, if_exists='replace',index=False)

The first line of the script makes itself that when it executed it will execute itself by using the python environment which is located in the first line.

Basically, the script does is extracting the transactions table from the prod db and loading the data into the warehouse we created by using Postgres after some data type transformations. There is another so important thing which is to provide refreshness the data. Let’s explore how I achieved that in the next step.

Providing Fresh Data by Using Crontab

crontab is a built-in program in most of Linux distros. We can schedule our cron jobs by using it. Also, if you wan to have deep knowledge about crontab, you should visit the link below.

We can use it by executing crontab -e command in the terminal. I added the line below into the crontab file.

0 0,8,16 * * * /home/path/project/venv/bin/python3 /home/path/project/script.py

I provide fresh data at every 00.00, 08.00 and 16.00 everyday. You can learn more about how to schedule time with crontab by visiting the link below.

Actually, we have created a simple ETL pipeline by using pure Python. Now, we need to create reports by using this data. Let’s see how to create reports in the next section.

Set-up Superset & Creating BI Reports

To create a Superset instance, I got help from my resources I created about the exact same problem.

You can easily up the Superset by using the repo. I just cloned the repo into the client’s server and then up the instance. The below image is an example one of the client’s reports.

Image by Author

I also used the script I mentioned above to bind a domain into the Superset instance.

Finally

Hopefully, you enjoyed the story. I thought it would be helpful to share a real-life case. I know the processes would be improved. However, the client’s needs were so simple. Therefore, I chose the easiest ways to solve their problem.

Kind regards.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Baysan

Baysan

321 Followers

Lifelong learner & Freelancer. I use technology that helps me. I’m currently working as a Business Intelligence & Backend Developer. mebaysan.com