Data warehouse Tech Stack with Postgres, dbt, and Airflow

5 min readJul 20, 2022

Introduction

This blog is going to present you the ELT pipeline designed for a project which involves the process starting from the data extraction to presentation. ELT Pipeline refers to the process of extracting data from source systems, loading it into a Data Warehouse environment, and then transforming it using database operations.

The workflow of this pipeline starts by reading the raw CSV data from the source into Postgress database, transform it using dbt, present the output by using Redash, and the final output then can be used by the data consumers such as BI tools, or Machine learning models. In the meantime, Airflow is used to schedule a daily job to sync the data from the source and perform the overall operations.

Objective of the project

A city traffic department wants to collect traffic data using swarm UAVs (drones) from a number of locations in the city and use the data collected for improving traffic flow in the city and for a number of other undisclosed projects. This project aims to develop a scalable data warehouse that will host the vehicle trajectory data extracted by analyzing footage taken by swarm drones and static roadside cameras.

The Data used

The data used in this project is the pNEUMA public dataset. pNEUMA is an open large-scale dataset of naturalistic trajectories of half a million vehicles that have been collected by a one-of-a-kind experiment by a swarm of drones in the congested downtown area of Athens, Greece. Each file for a single (area, date, time) is approximately 87MB in size.

Tech Stack

Airflow: an open-source workflow management platform which offers the ability to develop, monitor, and schedule workflows programmatically. Airflow pipelines are defined in Python, which are then converted into Directed Acyclic Graphs (DAG). Airflow offers numerous integrations with third-party tools, including the Postgres Airflow Operator and can be run locally using Docker Compose. Airflow is used in our pipeline to schedule a daily job that triggers the Postgres sync, followed by the dbt transformation.
dbt (Data Build Tool): an open-source data transformation tool that relies on SQL to build production-grade data pipelines. dbt replaces the usual boilerplate DDL/DML required to transform data with simple modular SQL SELECT statements and handles dependency management. dbt provides a cloud-hosted option and a CLI, a Python API and integration with Airflow. In our pipeline, dbt applies a simple transformation on the ingested data using a SQL query.
PostgreSQL: an advanced, enterprise class open source relational database that supports both SQL (relational) and JSON (non-relational) querying. I used it to store the raw and transformed data.
Redash: an open source web application used to explore, query, visualize, and share data from our data sources.
Docker: an open-source platform for building, deploying, and managing containerized applications
Python: an object oriented programming language used to write the data extracting, loading and transforming scripts.

The Implementation

1. Data Extraction and Loading

The pNEUMA dataset is used in this project which is available publicly in CSV Format. Each row of the dataset represents the data of a single vehicle. The first 10 columns in the 1st row include the columns’ name, where the first 4 columns include information about the trajectory like the unique trackID, the type of vehicle, the distance traveled in meters and the average speed of the vehicle in km/h. The last 6 columns are then repeated every 6 columns based on the time frequency. Therefore in order to make the data represented in rectangular format, it is rearranged by taking the first four rows, and iterate through the next six columns to get the proper relational form of the data.

2. The Postgresql (Data Warehouse)

A data warehouse is a central repository where information is coming from one or more data sources. It is an electronic method of organizing information in the cloud which combines a database and a supercomputer for transforming the data. Nowadays we can store data in the cloud through tools such as snowflake, but for this project I have used PostgreSQL as our data warehouse. In this scenario these are the tasks we need to perform with our data warehouse:

Create a database in PostgreSQL
Create a table by specifying the data types of each columns
Import the CSV data into our Postgres warehouse
Execute SQL queries to check the data and transform it using dbt

3. The Airflow Orchestration

An Airflow dag script that read the data, create a table, and load the extracted data into the created Postgres table is written which make use of the Postgres and Python operators. The dbt transformation dag is also maintained by the Airflow which involves dbt seed, run, test, and generate docs.

3. The dbt Transformation

A python script that allows the dbt transformation models to transform the raw data is written to generate the overall vehicle’s information, such as computing the total number of vehicles in each category, average speed of the vehicle, and the total traveled distance. In addition to that, the individual vehicle information for the six vehicle types such as car, taxi, bus, motorcycle, medium, and heavy vehicles is extracted from the dataset.

4. Redash and Visualization

Visualization of the data using Redash is the last step of the overall process.
Redash instantly develop a dashboard for us where we can view these outcomes. We can pull our data by integrating PostgreSQL and Redash, and it allows us to create queries that evaluate data using a variety of criteria. You can immediately execute this query on top of the data sources. The Dashboard will then be constructed in order to produce effective visualization using a variety of charts and with more user-customizable controls. While doing this, any query can be created, saved for later use, and used as an input for a dashboard.

Summary

I’ve learned from this data warehousing project that by combining these incredible tools in real time, we may manage to carry out such complicated tasks. The data is extracted, analyzed, and transformed using SQL queries in dbt, saved to our Postgres data warehouse, and then potentially published on Redash for sensing it. This is merely a typical illustration of how an ELT pipeline reduces overhead.

Here is the link to the source code in GitHub.