Data Warehouse Tech Stack with PostgreSQL, DBT, Airflow, and Redash

Degaga Wolde
10 min readSep 24, 2022

--

Author: Degaga Wolde

Data Warehouse Tech Stack with PostgreSQL, DBT, Airflow, and Redash

Table of Contents

Objective
Inroduction
Airflow
PostgreSQL
DBT
Redash

Objective

The objective of this week’s challenge is to design and build a data warehouse for the A city traffic department. The city traffic department collects traffic data using swarm UAVs (drones) from several locations in the city and uses the data collected for improving traffic flow in the city and for several other undisclosed projects. My responsibility is to create a scalable data warehouse that will host the vehicle trajectory data extracted by analyzing footage taken by swarm drones. Hence, I am expected to use the ELT process with DBT. I have to also use airflow to schedule the processes and Postgres for storage.

Introduction

Organizations can collect massive amounts of data, and they need the right people and technology to ensure it is in a highly usable state by the time it reaches data scientists and analysts. The tasks of a Data engineer are designing and building systems for collecting, storing, and analyzing data at scale.

A data warehouse (uni ed data repository) is a system by which data engineers make data accessible so that organizations can use it to evaluate and optimize their performance. The ETL process is used to combine data from different sources and load it into a data warehouse.

There are tools like airflow, dbt, Postgres, redash, etc. for data extraction, transformation, and loading. In this blog, we are going to see in detail what the tools are for and their unique characteristics., we are going to see in detail what the tools are for and how they work.

Data Warehousing

Data Warehouse stores historical data from across an organization processes the data and makes it possible to use the data for critical business analysis, reports and dashboards. It stores structured, cleaned up, and organized data for specific business purposes, and serves it for reporting or downstream tasks. Data warehouses usually consolidate historical and analytic data derived from multiple sources. Data warehouses separate analysis workload from transaction workload and enable an organization to consolidate data from several sources.

ETL/ELT

According to IBM ETL is a process that extracts, transforms, and loads data from multiple sources to a data warehouse or other uni ed data repository. E-T-L stands for extract, transform and load. These operations are used to combine data from multiple sources into a single, consistent data store that is loaded into a data warehouse. The most obvious difference between ETL and ELT is the order of operations. ELT copies or exports the data from the source locations, but instead of loading it to a staging area for transformation, it loads the raw data directly to the target data store to be transformed as needed.

ELT is beneficial for high-volume, unstructured datasets as loading can occur directly from the source. ELT can be ideal for extensive data management since it does not need much upfront planning for data extraction and storage.

Raw data is copied or exported from source locations to a staging area during data extraction. Data sources can be structured or unstructured. Those sources include but are not limited to:

  • SQL or NoSQL servers
  • CRM and ERP systems
  • Flat files
  • Email
  • Web pages

In the staging area, the raw data undergoes data processing. The data is transformed and consolidated here for its intended analytical use case. This phase can involve the following tasks:

  • Filtering, cleansing, de-duplicating, validating and authenticating the data.
  • Performing calculations, translations, or summarizations based on the raw data. This can include changing row and column headers for consistency, converting currencies or other units of measurement, editing text strings, and more.
  • Conducting audits to ensure data quality and compliance
  • Removing, encrypting, or protecting data governed by industry or governmental regulators
  • Formatting the data into tables or joined tables to match the schema of the target data warehouse.
  • In the last step, the transformed data is moved from the staging area into a target data warehouse. Typically, this involves an initial loading of all data, followed by periodic loading of incremental data changes and, less often, full refreshes to erase and replace data in the warehouse.

In the last step, the transformed data is moved from the staging area into a target datawarehouse.

PostgreSQL

PostgreSQL is a powerful, open-source object-relational database system. Installation guidelines for each OS can be found here.

Opening and accessing the database with Aminer for MacOS.

Airflow

Apache Airflow is used as a workflow management platform for data engineering pipelines. It is an open-source package written in Python, and workflows are created via Python scripts. Airflow has six core components.

  1. Scheduler — Triggers schedule workflows and submit tasks to an executor
  2. Executor — Handles running tasks
  3. Worker — Runs the actual tasks
  4. Webserver — User interface to inspect, trigger and debug DAGs and tasks behavior
  5. Metadata database — stores information about DAGs and tasks states
  6. DAGs folder — directory where all DAGs code is persisted, read by scheduler and executor.

I used the Docker Installation to set up the airflow on my local machine.

  1. Fetch the docker-compose docker-compose.yaml It creates an airflow container.
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.4.0/docker-compose.yaml'

2. Run docker-compose airflow-init and then docker-compose up in order. This will create and run around 7 images in airflow containers. A folder with the name ‘dags’ will be created and the scripts and the dags file are put in there.

docker-compose up airflow-init
docker-compose up
Containers created by the docker-compose.yml installation of airflow.

To access the airflow-webserver on a browser, you can use localhost:8080.

airflow-webserver and dags logs opened on localhost:8080

DBT(Data Build Tool)

dbt enables analytic engineers to transform data in their warehouse-it operates on data already within a warehouse, making it easy for data engineers to build complex pipelines from the comfort of their laptops. While it does not perform extraction and loading of data, it is powerful at transformations.

dbt is a powerful tool because it enables data engineers to mainly focus on writing models that reflect core business logic. There is no need to write boilerplate code to create tables and views. Boilerplate code to materialize queries as relations is handled by dbt. dbt ships with the following built-in materializations:

  • View (default): The model is built as a view in the database.
  • Table: The model is built as a table in the database.
  • Ephemeral: The model is not directly built in the database, but is instead pulled into dependent models as common table expressions.
  • Incremental: The model is initially built as a table, and in subsequent runs, dbt inserts new rows and updates changed rows in the table.

It’s advantage

  • Quickly and easily provide clean, transformed, and ready data analysis.
  • Apply software engineering practices — such as modular code, version control, testing, and CI-CD — to analytics code.
  • Build reusable and modular code using Jinja.
  • Maintain data documentation and definitions within dbt as they build and develop lineage graphs.

dbt works with different data platforms(Postgres, spark, snowflake, BigQuery, etc). To install the dbt we can use either the pip package installation or Docker images distributed via GitHub Packages. The full guide can be found at https://docs.getdbt.com/dbt-cli/install/overview. But let us how to install dbt-Postgres using pip.

pip install dbt-postgres# To create a dbt directory structure in the <foldername> run
dbt <foldername>

We will have folders like models, analysis, etc, and a file named dbt_project.yml. To connect to a warehouse in this case the Postgres database, we have to configure the profiles.yml found under the ~/.dbt/profiles.yml. Detailed configuration steps explained at https://docs.getdbt.com/dbt-cli/configure-your-profile

# example profiles.yml file
pNEUMA_DBT:
target: dev
outputs:
dev:
type: postgres
host: localhost
user: <username>
password: <password>
port: 5434
dbname: <database>
schema: <schema>
threads: 4

File configuration of the dbt_project.yml.

# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'pNEUMA_DBT'
version: '1.0.0'
config-version: 2
# This setting configures which "profile" dbt uses for this project.
profile: 'pNEUMA_DBT'
# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models
# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
pNEUMA_DBT:
# Config indicated by + and applies to all files under models/example/
example:
+materialized: view

I connected to the airflow running in the docker-desktop by using localhost and port 5434-exposes the port 5432 of my postgres in the docker container to external network.

Then I created two models fastest_and_longest_distance, avg_speed, and traveled_distance.

#first model ->fastest_and_longest_distance
{{ config(materialized='view') }}
with fastest_and_longest_distance as (
select *
from {{ref('fastest_veh_dbt_model')}}
order by traveled_distance DESC
)
select * from fastest_and_longest_distance
#second model->avg_speed
{{ config(materialized='view') }}
with source_data as (
select * from vehicles where avg_speed>40
)
select * from source_data
#third model->traveled_distance
{{ config(materialized='view') }}
with source_data as (
select * from vehicles where traveled_distance>400
)
select * from source_data
‘dbt debug ‘— the database and the debt should successfully connect to the data warehouse before running the models.
‘dbt run’ — run the models and save the view to the database.

Redash

Redash is a Business Intelligence tool that contains many robust integration capabilities compared to existing Data Analytics platforms, making it a favorite for companies that have deployed numerous applications for managing their business processes. You can quickly integrate with Data Warehouses, write SQL queries to pull subsets of data for visualizations, and share dashboards with greater ease. It is designed in a way that it can be used by anyone regardless of their level of technical expertise. The team’s main objective was to create a tool that makes it easy for data practitioners to collaborate and democratize access to insights for all teams.

Although Redash is a Web-based Business Intelligence tool, it o ers an open-source version, which can be deployed by hosting it on your private Servers. AWS EC2 AMI, DigitalOcean, Docker, etc. can be sued to create instances, with a minimum of 2 GB of RAM. The key features of Redash are as follows:

  1. There is no need for the users to set up an environment since this is a Web-based tool.
  2. This tool houses a Query Editor that enables users to compose database queries by leveraging the Schema Browser and auto-complete functionality.
  3. It gives users the ability to create visualizations using the drag-and-drop functionality and combine them into a single dashboard.
  4. Allows users to share visualizations and their associated queries easily and enable peer review of reports and queries.
  5. Allows automatic updating of charts and dashboards at custom time intervals.

Redash is installed using the docker-compose file downloaded from https://github.com/getredash/redash/blob/master/docker-compose.yml.

version: "2"
x-redash-service: &redash-service
image: redash/redash:8.0.0.b32245
depends_on:
# - postgres
- redis
env_file: redash.env
restart: always
services:
server:
<<: *redash-service
command: server
ports:
- "5001:5000"
environment:
REDASH_WEB_WORKERS: 4
networks:
- proxynet

scheduler:
<<: *redash-service
command: scheduler
environment:
QUEUES: "celery"
WORKERS_COUNT: 1
networks:
- proxynet
scheduled_worker:
<<: *redash-service
command: worker
environment:
QUEUES: "scheduled_queries,schemas"
WORKERS_COUNT: 1
networks:
- proxynet
adhoc_worker:
<<: *redash-service
command: worker
environment:
QUEUES: "queries"
WORKERS_COUNT: 2
networks:
- proxynet
redis:
image: redis:5.0-alpine
restart: always
networks:
- proxynet
# postgres:
# image: postgres:9.6-alpine
# env_file: redash.env
# volumes:
# - /opt/redash/postgres-data:/var/lib/postgresql/data
# restart: always
nginx:
image: redash/nginx:latest
ports:
- "80:80"
depends_on:
- server
links:
- server:redash
restart: always
networks:
- proxynet
# create a network for the commonication ammong the image in the
# redash continer and airflow container.
networks:
proxynet:
name: custom_network

Using the ‘docker-compose up’ we can build and run the redash container. After that using the localhost:5001 we can access the redash web workers. 5001 is used to expose the 5000 to the external network.

Redash the home page after you create an account and log in.

Redash Data Visualization

The average speed of the vehicle on a trajectory vs time
average speed and distance traveled vs vehicle type

Conclusion

In this blog, we have seen how to set up the data warehouse tech stack with PostgreSQL, DBT, Airflow, and Redash. PostgreSQL — for database, Airflow for scheduling and management of data operations, dbt for data transformations, and redash to build dashboards & visualization.

Future Work

My next blog will focus on how to use snowflake, dbt, spark, and other tools to really automate data-driven traffic management. I will be explaining this concept using the AWS cloud service rather than using a local machine.

References

--

--

Degaga Wolde

Machine Learning Engineer / Computer Vision Research Engineer