Building a Modern Data Stack with Docker: PostgreSQL, Airflow, dbt, and Superset

Matt Heiting
7 min readJul 25, 2024

--

In this article, we will walk through the process of setting up a modern data stack using Docker. We will cover how to configure PostgreSQL as a data warehouse, use Airflow to orchestrate data ingestion and transformation, run dbt models, and visualize the data with Superset. By the end of this guide, you’ll be able to clone my GitHub repository and get everything set up and running. Don’t mind the overcomplicated airflow set up ;) as I decided to build this demo using a previously used airflow setup that was much more complicated.

Notes:

  • If you are using windows you will need to alter the code a little, you can see those adjustments in the video below.
  • Once you get everything up and running through docker compose, you will need to initialize the dbt project from within the airflow docker container. To do this you will need run the following code:
# get the container id for the airflow service
docker ps

# open the container in interactive mode
docker exec -it <container_id> /bin/bash

# cd into the dbt folder
cd dbt

# run dbt deps and dbt compile
dbt deps
dbt compile
  • Once the above code is executed you will see a logs folder and a target folder generated inside the dbt directory. Now all of the dbt related dags should import correctly and you will be able to run everything.
  • Upload the superset prebuilt dashboards zip file into Superset to get the prebuilt dashboard, charts, datasets, and connection details. The zip file is located in the assets folder in the superset directory. You can do this by selecting the dashboards section from the menu, then clicking the upload icon on the right side of the page.

Table of Contents

1. Introduction and Project Setup

2. Configuring PostgreSQL

3. Setting Up Airflow

4. Running dbt Models

5. Visualizing Data with Superset

6. Conclusion

Follow along in the video if you’d like!

First, ensure you have Docker and Docker Compose installed on your machine. If you do not, follow the instructions here to do so. Clone my GitHub repository to get started with the project setup. If you do not have git installed follow the instructions here.

git clone https://github.com/mattydoez/data-tech-demo.git
cd data-tech-stack
code .

This project uses the following Docker services:

  • PostgreSQL for storing and processing data
  • Airflow for managing data workflows
  • dbt for data transformations
  • Superset for data visualization

The sample data was pulled from Maven Analytics. If you would like to play with other sample data, they have a great selection of various datasets you can download and use to build your own dbt models and pipelines in airflow.

Configuring PostgreSQL

PostgreSQL is used as the primary data warehouse in this project. The docker-compose.yml file sets up three PostgreSQL instances: one for Airflow’s metadata (airflowdb), another for the data warehouse (company_dw), and one for Superset (superset_db).

PostgreSQL Details

We added a postgres.conf file so that we could change some default settings for our data warehouse, mostly so that we could run concurrent queries in Superset without much issue. You will notice that we have this file in its own folder and it is used in the docker service command to start up the company_dw service. Note that we have to first mount the file to the image so that when the container runs the command it has access to th conf file.

Setting Up Airflow

Apache Airflow is used to orchestrate data workflows, including data ingestion and running dbt models. The Airflow setup includes several services: scheduler, webserver, worker, and an init service.

Airflow Details

First I want to point out we only are using one Dockerfile for all of the airflow services, and to do this I had to create an entrypoint.sh script. The script allows for the command to start the service change depending on the first argument in the command when the container is started.

Additionally, the way we are using dbt is by installing it in the airflow image. You can see it listed in the Dockerfile. There are many ways to run dbt models from airflow and I chose this way for this project in an attempt to try something new!

Airflow Dags Folder

ingest_crm_sales_data.py

  • This DAG is responsible for ingesting CRM sales data into the PostgreSQL data warehouse. It reads CSV files from a specified location, creates the necessary tables in the database, and loads the data.

Key Tasks:

1. Create Schema: Ensures that the crm_sales_data schema exists.

2. Delete Existing Tables: Drops any existing tables to ensure clean ingestion.

3. Ingest Data: Loads data from CSV files into the corresponding tables.

run_dbt_init_tasks.py

  • This DAG initializes dbt tasks, such as setting up the dbt environment and running initial commands like dbt deps and dbt seed.

Key Tasks:

1. Install dbt Dependencies: Runs dbt deps to install package dependencies.

2. Seed Data: Executes dbt seed to load seed files into the data warehouse.

run_dbt_models_crm_sales.py

  • This DAG runs the dbt models specific to CRM sales data. It ensures that the transformations defined in dbt are applied to the ingested data.

Key Tasks:

1. Run dbt Models: Executes dbt run to run all the models and apply transformations.

2. Run dbt Tests: Executes dbt test to run tests and validate the models.

run_dbt_module.py

  • This file contains helper functions and modules that are used across various dbt-related DAGs. It might include functions for loading the dbt manifest, creating dynamic tasks, or other utilities.
  • In this file specifically I created a module that allows me to create a dag for any set of dbt models I want, specify a schema, and test each model. This is done all from within one function and can be used in multiple dags.

Running dbt Models

dbt (Data Build Tool) is used for transforming data in the data warehouse. The dbt models are stored in the dbt directory and are executed by Airflow as part of the data pipeline. Below, we outline the key components of the dbt setup and how it integrates with the overall data stack.

dbt Project Structure

A typical dbt project includes the following directories and files:

  • models/: Contains SQL files that define your dbt models. Each file corresponds to a table or view in the data warehouse.
  • seeds/: Contains CSV files that dbt can load directly into the data warehouse as tables.
  • snapshots/: Contains SQL files that define snapshot logic to track changes over time.
  • dbt_project.yml: The main configuration file for the dbt project.
  • profiles.yml: Configuration file containing connection details for the data warehouse.

Staging Models

Staging models are typically materialized as views. The purpose of staging models is to clean, format, and join raw data from source tables without creating physical tables, thus saving storage space and ensuring data freshness. By using views, we can leverage the latest data from source tables without the need for periodic table refreshes.

Final Models

Final models, often referred to as marts or fact tables, are materialized as tables. These models are typically more complex and computationally intensive, involving joins, aggregations, and other transformations. Materializing these as tables ensures faster query performance for downstream analysis and reporting.

Visualizing Data with Superset

Apache Superset is used for data visualization. The Superset setup includes a PostgreSQL instance for storing metadata and a Superset service for the web interface. I have provided some out of the box visualizations and models for you to build on or just see an as example of what we can do with this service.

Superset Details

In Superset, we have to first connect a database to create a connection, then create a dataset. A dataset can be a table, or you can define a custom query to create this dataset. Additionally, if you want your custom query to persist in the data warehouse as a table you can do that as well.

Once a dataset is created you explore the data to create charts, which eventually get added to the dashboards.

*note — I added in some custom CSS to make the dashboard look better. The CSS can be found in the superset folder, and then copied and pasted into the custom CSS editor from the edit dashboard menu.

Analyzations

Prior to creating the dbt models and dashboard, I had to understand the data and what the data signifies. In the sample data we are looking at sales pipelines and their performance, and we have some other data we can join to the pipeline data to be able to provide some more insightful information. I asked myself, what types of things might this fictional company be interested in, and I came up with these general questions:

  1. How is each sales team performing compared to the rest?
  2. Are any sales agents lagging behind?
  3. Can you identify any quarter-over-quarter trends?
  4. Do any products have better win rates?

I then focused on creating fact tables that could support these questions, so that I could create a dashboard with all of the information that could be valuable. I would like to note here that just because I think something is important does not mean a business will. In the real-world I would work closely with the stakeholders to understand how I could present data to not only fulfill their questions, but also provide some action items. In the end, we get a dashboard that looks like this:

Conclusion

By following this guide, you have set up a complete data stack using Docker. You can ingest raw data from Maven Analytics, transform it using dbt models orchestrated by Airflow, and visualize it using Superset. This setup provides a robust, scalable, and easy-to-manage data infrastructure.

To get started, clone the repository and run:

git clone https://github.com/your-repo/data-tech-stack.git
cd data-tech-stack
docker-compose up -d --build

Next you will perform the dbt commands above from within the airflow docker container.

Run the dags in the following order:

  1. ingest_crm_sales_data
  2. run_dbt_init_tasks
  3. run_dbt_models_crm_sales

Resources

Docker Documentation

PostgreSQL Documentation

Airflow Documentation

dbt Documentation

Superset Documentation

By using this setup, you can quickly deploy a powerful data stack for your projects, enabling efficient data management, transformation, and visualization.

--

--