Getting started with Airflow: Deploying your first pipeline on Kubernetes

A tutorial to get you started locally in 15 mins

Rupert Arup
7 min readFeb 4, 2024

My name is Rupert, better known as “Ru” and I’m a Data Engineer based in London. My most recent position was Data Team Lead at Lendinvest plc, where I was responsible for building the data platform from the ground up. I saw the business go from a start-up in 2016 to IPO in 2021 and we hit all the typical pitfalls of working with data along the way. Previous to that, I spent two years working at Facebook. I love everything to do with data, and particularly like getting my hands dirty POCing tools before going to production.

As I enter a new Airflow-centric role I’ve been brushing up my knowledge of deploying DAGs to a Kubernetes deployment of Airflow. The following article will cover how I achieved it using Minikube (a local version of Kubernetes that runs on docker) …

What I will cover:

  • What is Airflow?
  • Installing Minikube
  • Deploying Airflow to Minikube
  • How to load DAGs/code to Airflow
  • How to add dependencies to our Airflow image

What is Airflow?

Airflow is one of the leading open source data orchestration tools — originally created by Airbnb in 2014 to manage its growingly complex data pipelines — that every data practitioner should know about. A data orchestration tool provides a framework for writing data pipelines neatly and scheduling tasks to fulfil jobs.

When testing out new tools, it is always useful to spin it up locally to minimise the cost. Whilst it’s impossible to run a multi machine K8s locally — that is fit for production — there are ways to mimic this with a single machine using Minikube or Kind which can spin up multiple nodes using Docker. This article demonstrates how to get an Airflow environment up and running within 15 minutes on Kubernetes using Minikube. This is not intended to be used in production but should provide an environment to test an Airflow data pipeline.

I will demonstrate using Minkube, GitSync and a custom Docker image if you need to extend on the official apache/airflow image.

NB. This is a Mac focused tutorial that assumes understanding of Kubernetes

Installing Minikube

Pre-requisites:

  • Docker desktop
  • Brew package manager
  • Understanding of Kubernetes container orchestrat

Brew install minikube, helm and kubectl:

brew install minikube kubectl helm

Check the minkube dashboard works:

minikube start
minikube dashboard

Test accessing the dashboard that is linked in the terminal output:

“Opening http://127.0.0.1:61095/api/v1/namespaces/kubernetes-dashboard/services/http:kubernetes-dashboard:/proxy/ in your default browser…”

In a new terminal create the deployment and then expose this as a service:

kubectl create deployment hello-minikube --image=kicbase/echo-server:1.0
kubectl expose deployment hello-minikube --type=NodePort --port=8080

Check that the test services are live by running:

kubectl get services hello-minikube

This should return a list of services in your CLI

Clean up and bring down tests:

kubectl delete service hello-minikube
kubectl delete deployment hello-minikube

Deploying Airflow to Minikube

Now we are happy that Minikube is up and running on Docker and we can access the Kubernetes dashboard we will begin installing the latest image of Airflow. To do this quickly we will use the helm chart distribution that Airflow maintains. More information can be found here.

Run the following commands:

helm repo add apache-airflow https://airflow.apache.org
helm upgrade --install airflow apache-airflow/airflow --namespace airflow --create-namespace

This should take a few minutes. Heading into your dashboard and to the Airflow namespace you should see the following services deployed:

Or simply run the following commands to check the pods and application are up and running in our new namespace airflow:

> kubectl get pods -n airflow

NAME READY STATUS RESTARTS AGE
airflow-postgresql-0 1/1 Running 0 10h
airflow-redis-0 1/1 Running 0 10h
airflow-scheduler-777f7947dc-8vnn4 2/2 Running 0 10h
airflow-statsd-786d447967-gz98g 1/1 Running 0 10h
airflow-triggerer-0 2/2 Running 0 10h
airflow-webserver-87b9b8f6f-9knl7 1/1 Running 4 (29m ago) 10h
airflow-worker-0 2/2 Running 0 10h

> helm ls -n airflow

NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
airflow airflow 1 2024-02-01 23:19:26.069568 +0000 UTC deployed airflow-1.11.0 2.7.1

To access the Airflow UI in our browser we now need to port forward the airflow-webserver. We will use port 8080 (ensure you have nothing else running on this port before doing so).

kubectl port-forward svc/airflow-webserver 8080:8080 - namespace airflow

Now head into your browser to localhost:8080 and you should arrive at the Airflow UI:

Airflow UI

How to load your code/DAGs to Airflow?

Now we have Airflow running locally on Kubernetes, it is time to add your user code (a collection of python defined data pipelines). We are going to use a common methodology known as GitSync which allows us to define a git repository (I am using github) that will be sync periodically with our application.

Steps:

1. Add a test python DAG to your git repo (set this as public for ease for now). Feel free to copy my mock DAG: https://github.com/rarup1/airflow-demo-dags.

2. Make changes to the default values.yaml file that is loaded in our application:

a) Export values to a local file

mkdir airflow-config
cd airflow-config
helm show values apache-airflow/airflow > values.yaml
open values.yaml

b) Update default settings for gitSync in values.yaml to enable:

  gitSync:
enabled: true
repo: https://github.com/rarup1/airflow-demo-dags.git
branch: main
rev: HEAD
depth: 1
# interval between git sync attempts in seconds
# high values are more likely to cause DAGs to become out of sync between different components
# low values cause more traffic to the remote git repository
wait: 5
containerName: git-sync
uid: 65533
securityContext: {}
securityContexts:
container: {}
containerLifecycleHooks: {}
extraVolumeMounts: []
env: []
resources: {}
# limits:
# cpu: 100m
# memory: 128Mi
# requests:
# cpu: 100m
# memory: 128Mi

Replace the repoto wherever your DAGs have been committed (see my Github repo for the example code). If this is in a private repository you will need to add a SSH key. We will not cover this in this demo.

c) Now upgrade our local deployment using helm command:

helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug

This should enable the gitSync and after reloading the Airflow UI you should see the test DAG loaded. If the UI is down, this is normal and you will need to return to your CLI and rerun kubectl port-forward svc/airflow-webserver 8080:8080 — namespace airflow.

If you have reached this stage you can now kick off a DAG in the UI using the play button.

Adding dependencies to your Airflow image?

Since we are creating a simple mock pipeline above we don’t add any providers or Python packages to our Docker image. You can check the pre-installed providers and Python packages by running:
kubectl exec -n airflow — airflow providers list
kubectl exec -n airflow -it airflow-scheduler-<scheduler_pod_id> — — pip freeze

NB. You can fetch <schueduler_pod_id> using commandkubectl get pods -n airflow

If you do want to add dependencies, it is best to create a Dockerfile which will define these dependencies in a requirements.txt file (as you would for any python dockerized application) and then store this in a private repository, accessible to the cluster. This will reduce the amount of compute needed as each pod is spawned up to start a task. We will build the Docker image from the Dockerfile and then update our values.yaml file.

Follow these steps:

  1. Add arequirements.txt file with the Airflow provider for dbt (as an example):
apache-airflow-providers-dbt-cloud==3.6.0

2. Add a Dockerfile file:

FROM apache/airflow:2.7.1

COPY requirements.txt .

RUN pip install -r requirements.txt

3. Build the docker image and add to minikube:

docker build -t my-airflow:1.0.0 .
minikube image load my-airflow:1.0.0

4. Update vaulues.yaml
In order for our minikube K8s Airflow deployment to pick up the change in image (including our dbt provider) we will need to amend our values.yaml:

# Default airflow repository -- overridden by all the specific images below
defaultAirflowRepository: my-airflow

# Default airflow tag to deploy
defaultAirflowTag: "1.0.0"

NB. by replacing apache/airflowwith my-airflowwe are ensuring our local instance depends on the created docker image. If we want to deploy this to production we would point this at our AWS Elastic Container Registry or Docker Registry instead of a local image.

And then we can helm upgradeagain:

# upgrade
helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug
...

# check apache-airflow-providers-dbt-cloud is present in the providers
kubectl exec airflow-webserver-<pod_id> -n airflow -- airflow providers list

We should be able to see that the provider is now present in our deployment. Success!

The final state of the code can be found in my github repo here: https://github.com/rarup1/airflow-demo-dags

That is it! You have now got a custom deployment of Airflow based on the official helm chart with GitSync working. This will allow you to test out Airflow locally using the same configuration as you can expect if your company run Airflow on K8s in production.

I hope this helps you prepare for your Airflow proof-of-concept and builds your knowledge of running Airflow on Kubernetes. Thank you for reading!

--

--

Rupert Arup
Rupert Arup

Written by Rupert Arup

Senior Data Engineer based in London working in the city.

No responses yet