Building a Modern Data Pipeline Part 3: Docker Configuration

5 min readFeb 20, 2024

This is the third part of a six-part series titled ‘Building a Modern Data Pipeline: A Journey from API to Insight’ related to this GitHub repo. It steps through setting up a Data Pipeline and running the pipeline end-to-end on your local machine.

Part 1 was a high level overview, while part 2 stepped through how to download and run the pipeline. This is part 3, and it looks at the configuration of Docker.

What is Docker?

In the field of Data Engineering, Docker has changed the way we build, deploy, and manage pipelines. By encapsulating applications into containers, Docker ensures consistency across multiple development and deployment environments, significantly reducing the “it works on my machine” problem. This containerization approach not only streamlines development workflows but also enhances scalability and efficiency in data processing tasks.

For data engineers, Docker’s ability to isolate dependencies, replicate data science environments, and facilitate microservices architectures makes it an indispensable tool in tackling complex data challenges. Embracing Docker means embracing a future where deployment is as reliable as your data analysis, ensuring that your data pipelines are both robust and resilient.

Custom Docker Images

As noted in previous posts, there are two custom images included in the repository. One for Airflow and the other for Jupyter.

There are thousands of pre-built images already available online. The reason for making custom images is that I wanted to bundle Polars and Datalake packages in from the start. Let’s have a look at how they work…

Starting out, we can build our images by running the build.sh file in the root folder. If you open the file, it looks like this:

#!/bin/bash

cd build/airflow
docker build -t data-pipeline-airflow-demo . -f DockerFile
cd ../jupyter
docker build -t data-pipeline-jupyter-demo . -f DockerFile

First of all it moves to the build/airflow folder. It then runs a docker command, then moves to the build/jupyter folder and runs a similar command.

Each command is building a custom image, specifying a DockerFile to use as the base of the image. Let’s have a look at one of those DockerFiles:

FROM quay.io/jupyter/scipy-notebook:2024-01-15
COPY requirements.txt /
RUN pip install --no-cache-dir -r /requirements.txt

This is the DockerFile for the custom Jupyter image. It’s fairly simple. We start with a base image. We then copy our requirements.txt file into this image, and then we run pip install, passing in the requirements.txt file which contains the packages we want to install into our image.

That’s it. We now have our own image that has everything we want installed and ready for use.

Docker-Compose

The docker-compose.yml file in the root folder is used to load a number of docker images, including our two custom images that we just created.

A docker-compose file is usually made up of three main parts, including definitions for:

Services: The containers your application uses, specifying images, build contexts, ports, environment variables, and more.
Networks: Communication paths and policies between containers, possibly defining custom networks for isolation or specific communication needs.
Volumes: Persistent storage configurations for data that needs to survive container restarts or sharing between containers.

Let’s have a look inside the file included in the repo:

version: '3.4'
 
services:
  airflow_base: &airflow_base
    image: data-pipeline-airflow-demo
    user: "${AIRFLOW_UID}:0"
    env_file:
      - .env
    volumes:
      - ./dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./plugins:/opt/airflow/plugins
      - ./pipelines:/opt/airflow/pipelines
      - ./seeds:/opt/airflow/seeds
      - /var/run/docker.sock:/var/run/docker.sock

postgres:
    image: postgres:13
    container_name: postgres
    ports:
      - "5434:5432"
    networks:
      - pipeline_network
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 5s
      retries: 5
    env_file:
      - .env
 
  scheduler:
    <<: *airflow_base
    depends_on:
      postgres:
        condition: service_healthy
      airflow-init:
        condition: service_completed_successfully
    container_name: airflow-scheduler
    command: scheduler
    restart: on-failure
    ports:
      - "8793:8793"
    networks:
      - pipeline_network
 
  webserver:
    <<: *airflow_base
    depends_on:
      postgres:
        condition: service_healthy
      airflow-init:
        condition: service_completed_successfully
    container_name: airflow-webserver
    restart: always
    command: webserver
    ports:
      - "8080:8080"
    networks:
      - pipeline_network
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 30s
      timeout: 30s
      retries: 5
   
  airflow-init:
    <<: *airflow_base
    container_name: airflow-init
    entrypoint: /bin/bash
    command:
      - -c
      - |
        mkdir -p /sources/logs /sources/dags /sources/plugins
        chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
        exec /entrypoint airflow version
    networks:
      - pipeline_network
  minio:
    image: docker.io/bitnami/minio:latest
    ports:
      - '9000:9000'
      - '9001:9001'
    networks:
      pipeline_network:
        ipv4_address: 10.5.0.5
    volumes:
      - 'minio_data:/data'
    environment:
      - MINIO_ROOT_USER=${MINIO_ROOT_USER}
      - MINIO_ROOT_PASSWORD=${MINIO_ROOT_PASSWORD}
      - MINIO_DEFAULT_BUCKETS=${MINIO_DEFAULT_BUCKETS}
    env_file:
      - .env
  jupyter:
    image: data-pipeline-jupyter-demo
    ports:
      - '8888:8888'
    environment:
      - JUPYTER_TOKEN=easy
    volumes:
      - ./notebooks:/home/jovyan/work
      - ./seeds:/home/jovyan/seeds
    networks:
      - pipeline_network

networks:
  pipeline_network:
    driver: bridge
    ipam:
      config:
        - subnet: 10.5.0.0/16
          gateway: 10.5.0.1

volumes:
  minio_data:
    driver: local

Lots there. But don’t worry, we can break it down.

It specifies configurations for Airflow (orchestration), PostgreSQL (database for Airflow), MinIO (object storage), and Jupyter (interactive notebooks), interconnected within a custom network. Here’s a breakdown:

Airflow Base: Template for Airflow services, specifying volumes for DAGs, logs, plugins, pipelines, seeds, and Docker socket for Docker operations within Airflow containers. Putting this in a template saves having to duplicate it in every container we want to use this base data.
PostgreSQL: Airflow database service with health checks to ensure readiness before dependent services start.
Scheduler & Webserver: Airflow components using the base template, set to depend on PostgreSQL’s health and initial setup completion. Ports are exposed for web access and command execution.
Airflow-Init: Initializes Airflow, setting up necessary directories and permissions.
MinIO: Object storage for data handling, accessible via specified ports, with environment variables for access control and bucket setup. It sets up three buckets for this project: bronze, silver and gold.
Jupyter: Service for interactive Python notebooks, with port exposure for web access and volume mappings for notebooks and data seeds.
Networks & Volumes: Defines a custom bridge network for inter-service communication and a volume for MinIO data persistence. This allows all of the containers to interact with each other on their own local network.

Creating Different Docker-Compose Files

You rarely need to set up a docker-compose.yml file from scratch. If there is a particular application that you want to run, you can generally search for that application along with the term docker-compose and find a number of articles or steps on how to get that specific application up and running. A good place to start is the Docker Hub.

Next Steps

That’s all for this post. The next post will be coming shortly, and is going to look at the Python code that Airflow uses to orchestrate the pipeline. It’s a relatively simply code, but it makes sure that things run in the right order. Stay tuned, and please feel free to share your thoughts. Your feedback and questions are highly welcome. Follow me for updates on this series and more insights into the world of data engineering.