Building a Modern Data Pipeline Part 3: Docker Configuration

Andy Sawyer
5 min readFeb 20, 2024

This is the third part of a six-part series titled ‘Building a Modern Data Pipeline: A Journey from API to Insight’ related to this GitHub repo. It steps through setting up a Data Pipeline and running the pipeline end-to-end on your local machine.

Part 1 was a high level overview, while part 2 stepped through how to download and run the pipeline. This is part 3, and it looks at the configuration of Docker.

What is Docker?

In the field of Data Engineering, Docker has changed the way we build, deploy, and manage pipelines. By encapsulating applications into containers, Docker ensures consistency across multiple development and deployment environments, significantly reducing the “it works on my machine” problem. This containerization approach not only streamlines development workflows but also enhances scalability and efficiency in data processing tasks.

For data engineers, Docker’s ability to isolate dependencies, replicate data science environments, and facilitate microservices architectures makes it an indispensable tool in tackling complex data challenges. Embracing Docker means embracing a future where deployment is as reliable as your data analysis, ensuring that your data pipelines are both robust and resilient.

Custom Docker Images

As noted in previous posts, there are two custom images included in the repository. One for Airflow and the other for Jupyter.

There are thousands of pre-built images already available online. The reason for making custom images is that I wanted to bundle Polars and Datalake packages in from the start. Let’s have a look at how they work…

Starting out, we can build our images by running the build.sh file in the root folder. If you open the file, it looks like this:

#!/bin/bash

cd build/airflow
docker build -t data-pipeline-airflow-demo . -f DockerFile
cd ../jupyter
docker build -t data-pipeline-jupyter-demo . -f DockerFile

First of all it moves to the build/airflow folder. It then runs a docker command, then moves to the build/jupyter folder and runs a similar command.

Each command is building a custom image, specifying a DockerFile to use as the base of the image. Let’s have a look at one of those DockerFiles:

FROM quay.io/jupyter/scipy-notebook:2024-01-15
COPY requirements.txt /
RUN pip install --no-cache-dir -r /requirements.txt

This is the DockerFile for the custom Jupyter image. It’s fairly simple. We start with a base image. We then copy our requirements.txt file into this image, and then we run pip install, passing in the requirements.txt file which contains the packages we want to install into our image.

That’s it. We now have our own image that has everything we want installed and ready for use.

Docker-Compose

The docker-compose.yml file in the root folder is used to load a number of docker images, including our two custom images that we just created.

A docker-compose file is usually made up of three main parts, including definitions for:

  • Services: The containers your application uses, specifying images, build contexts, ports, environment variables, and more.
  • Networks: Communication paths and policies between containers, possibly defining custom networks for isolation or specific communication needs.
  • Volumes: Persistent storage configurations for data that needs to survive container restarts or sharing between containers.

Let’s have a look inside the file included in the repo:

version: '3.4'

services:
airflow_base: &airflow_base
image: data-pipeline-airflow-demo
user: "${AIRFLOW_UID}:0"
env_file:
- .env
volumes:
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./plugins:/opt/airflow/plugins
- ./pipelines:/opt/airflow/pipelines
- ./seeds:/opt/airflow/seeds
- /var/run/docker.sock:/var/run/docker.sock

postgres:
image: postgres:13
container_name: postgres
ports:
- "5434:5432"
networks:
- pipeline_network
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 5s
retries: 5
env_file:
- .env

scheduler:
<<: *airflow_base
depends_on:
postgres:
condition: service_healthy
airflow-init:
condition: service_completed_successfully
container_name: airflow-scheduler
command: scheduler
restart: on-failure
ports:
- "8793:8793"
networks:
- pipeline_network

webserver:
<<: *airflow_base
depends_on:
postgres:
condition: service_healthy
airflow-init:
condition: service_completed_successfully
container_name: airflow-webserver
restart: always
command: webserver
ports:
- "8080:8080"
networks:
- pipeline_network
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 30s
retries: 5

airflow-init:
<<: *airflow_base
container_name: airflow-init
entrypoint: /bin/bash
command:
- -c
- |
mkdir -p /sources/logs /sources/dags /sources/plugins
chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
exec /entrypoint airflow version
networks:
- pipeline_network
minio:
image: docker.io/bitnami/minio:latest
ports:
- '9000:9000'
- '9001:9001'
networks:
pipeline_network:
ipv4_address: 10.5.0.5
volumes:
- 'minio_data:/data'
environment:
- MINIO_ROOT_USER=${MINIO_ROOT_USER}
- MINIO_ROOT_PASSWORD=${MINIO_ROOT_PASSWORD}
- MINIO_DEFAULT_BUCKETS=${MINIO_DEFAULT_BUCKETS}
env_file:
- .env
jupyter:
image: data-pipeline-jupyter-demo
ports:
- '8888:8888'
environment:
- JUPYTER_TOKEN=easy
volumes:
- ./notebooks:/home/jovyan/work
- ./seeds:/home/jovyan/seeds
networks:
- pipeline_network

networks:
pipeline_network:
driver: bridge
ipam:
config:
- subnet: 10.5.0.0/16
gateway: 10.5.0.1

volumes:
minio_data:
driver: local

Lots there. But don’t worry, we can break it down.

It specifies configurations for Airflow (orchestration), PostgreSQL (database for Airflow), MinIO (object storage), and Jupyter (interactive notebooks), interconnected within a custom network. Here’s a breakdown:

  • Airflow Base: Template for Airflow services, specifying volumes for DAGs, logs, plugins, pipelines, seeds, and Docker socket for Docker operations within Airflow containers. Putting this in a template saves having to duplicate it in every container we want to use this base data.
  • PostgreSQL: Airflow database service with health checks to ensure readiness before dependent services start.
  • Scheduler & Webserver: Airflow components using the base template, set to depend on PostgreSQL’s health and initial setup completion. Ports are exposed for web access and command execution.
  • Airflow-Init: Initializes Airflow, setting up necessary directories and permissions.
  • MinIO: Object storage for data handling, accessible via specified ports, with environment variables for access control and bucket setup. It sets up three buckets for this project: bronze, silver and gold.
  • Jupyter: Service for interactive Python notebooks, with port exposure for web access and volume mappings for notebooks and data seeds.
  • Networks & Volumes: Defines a custom bridge network for inter-service communication and a volume for MinIO data persistence. This allows all of the containers to interact with each other on their own local network.

Creating Different Docker-Compose Files

You rarely need to set up a docker-compose.yml file from scratch. If there is a particular application that you want to run, you can generally search for that application along with the term docker-compose and find a number of articles or steps on how to get that specific application up and running. A good place to start is the Docker Hub.

Next Steps

That’s all for this post. The next post will be coming shortly, and is going to look at the Python code that Airflow uses to orchestrate the pipeline. It’s a relatively simply code, but it makes sure that things run in the right order. Stay tuned, and please feel free to share your thoughts. Your feedback and questions are highly welcome. Follow me for updates on this series and more insights into the world of data engineering.

--

--

Andy Sawyer

Bringing software engineering best practices and a product driven mindset to the world of data. Find me at https://www.linkedin.com/in/andrewdsawyer/