Deploying Airflow on GKE using Helm

Rodel van Rooijen
11 min readApr 26, 2024

--

In this tutorial I will walk you through how to successfully deploy Apache Airflow (tested with 2.9.1) on Google Kubernetes Engine (GKE) using the official Helm chart. In my own search to find guides on how to this in detail, I couldn’t find a detailed guide. Hence I’ve written this guide to hopefully help people in similar situations.

The outcome of this tutorial should be:

  • A GKE cluster that can automatically scale in terms of node size.
  • Airflow deployed on Kubernetes with multiple schedulers and Celery workers.
  • A persistent volume that contains your DAGs to which you can Git-sync or manually push DAG updates.
  • An external Postgres database that contains the Airflow metadata.
  • An external location (bucket) that contains the Airflow logs.

Note that similar logic to that is described in this tutorial should apply to deploying Airflow on Amazon EKS and Azure AKS. However, the detailed steps are specific to Google Cloud Platform (GCP).

Why Kubernetes?

When I started to build out a data platform from scratch at a start-up, it started with a single virtual machine (VM) that was hosting Airflow with a local executor set-up. While it’s a great way to start, it’s not very scalable. Especially when having many DAGs and tasks memory and CPU can quickly become a bottleneck. Moreover, in this set-up there’s no resource isolation. This means that for example the scheduler and the local tasks can take up more and more CPU and memory. In turn this can severely impact the VM’s performance, to the point of it potentially crashing.

A way to “fix” this is to just provision a larger VM, right? Not really. In many cases Airflow will have peak loads at certain specific times, perhaps midnight UTC, or the start of every hour, etc. This is difficult to fully control, and hence you will probably provision a larger VM than necessary.

Airflow on Kubernetes, aims to provide an alternative to this. Running Airflow on a Kubernetes cluster gives the ability to leverage resource isolation, autoscaling options and increased stability (out of the box). It however, also comes at an increased complexity which might make it more difficult to maintain. Nevertheless, I would say it’s still a much better solution than hosting Airflow on a single instance.

Where to start?

If you’re not familiar with Kubernetes, this is great tutorial to get started. If you haven’t used Helm yet, this is great guide on how it works.

Requirements

This tutorial requires you to have:

  • An existing GCP project.
  • Docker desktop

In addition your GCP account should have:

  • GKE Admin permissions (roles/container.admin)
  • The ability to create service accounts (roles/iam.serviceAccountAdmin)
  • The ability to create a Postgres Cloud SQL instance (roles/cloudsql.admin)
  • Create repos and store images in GCP artifact registry (roles/artifactregistry.admin)
  • Manage Google Cloud Storage (GCS) buckets (roles/storage.admin)

Before continuing you will also have to make sure that you have the Google CLI installed such that you can use gcloud commands. In addition you will need to install kubectl to interact with your GKE cluster.

Setting things up

To select your relevant project make you can run gcloud config set project <YOUR_PROJECT_ID> . From here on we will assume that your project ID can be accessed in bash using $PROJECT_ID . Throughout this tutorial we will use europe-west1 as the primary GCP location.

Service account creation

In this tutorial we will set-up a service account that will be used by your Kubernetes cluster. In addition we will make use of Workload Identity Federation. This will ensure that your cluster can use the roles and permissions that are assigned to the service account we create here.

To create the service account you can use:

gcloud iam service-accounts create airflow-kubernetes \
--description="User-managed service account for the Airflow deployment" \
--display-name="Airflow Kubernetes"

Then assign the following roles:

gcloud projects add-iam-policy-binding $PROJECT_ID \
--member=serviceAccount:airflow-kubernetes@$PROJECT_ID.iam.gserviceaccount.com \
--role=roles/container.admin \
--role=roles/iam.serviceAccountUser \
--role=roles/iam.workloadIdentityUser \
--role=roles/storage.admin

If the commands above do not work, you can also create the service account and assign the roles in the Google Console UI.

Cluster creation

The cluster we will use is contained in a single zone and will by default have a single node. We will also use a standard cluster instead of an auto-pilot cluster to have more control over the deployment, in addition it also is generally cheaper to run small Kubernetes clusters on standard instead of autopilot.

We will create our cluster using the following bash commands:

gcloud container clusters create airflow-cluster \
--zone "europe-west1-b" \
--project $PROJECT_ID \
--machine-type n2-standard-4 \
--num-nodes 1 \
--scopes "cloud-platform" \
--autoscaling-profile "optimize-utilization" \
--enable-autoscaling --min-nodes=1 --max-nodes=3 \
--workload-pool $PROJECT_ID.svc.id.goog

Feel free to adjust the max-nodes argument to the maximum nodes the cluster should be able to auto-scale to. Notice that the machine-type is set to n2-standard-4 this is a small machine but should be able to perfectly handle a small Airflow deployment on its own. In addition the scope is set to "cloud-platform" meaning that the Kubernetes cluster will have access to all enabled GCP API’s and the autoscaling profile is set to "optimize-utilization" which is a more aggressive down-scaling policy.

Namespace

After the cluster is successfully created we will create a namespace that will contain the Airflow deployment. But first we will need to obtain the cluster credentials using kubectl. You can use this to do that:

gcloud container clusters get-credentials airflow-cluster \
--zone "europe-west1-b" \
--project $PROJECT_ID

From here you can create a namespace called “airflow”:

kubectl create ns "airflow"

After you have done this you execute the following command to set-up Workload Identity properly:

gcloud iam service-accounts add-iam-policy-binding airflow-kubernetes@$PROJECT_ID.iam.gserviceaccount.com \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:$PROJECT_ID.svc.id.goog[airflow/airflow]"

Load balancer

In this tutorial we will also make the webserver API available using an internal load balancer IP. This enables you to call the Airflow API within your VPC, while not exposing it externally. In addition we will show you how to connect to the Airflow UI locally using port-forwarding. This tutorial will not show you how to expose the Airflow webserver through a public IP or URL.

To reserve a static IP for the Airflow IP you can run:

gcloud compute addresses create airflow-api-ip \
--region europe-west1 \
--subnet default

gcloud compute addresses describe airflow-api-ip

In case you use a different VPC subnet than default you will need to replace this. The internal IP address that is created you will need to save, in this tutorial we assume this can be accessed by INTERNAL_IP = 10.132.0.2.

Image repository

Using a Dockerfile we will build a custom Airflow image to which you can add your DAGs, plugins and other requirements. To be able to save these images we will use GCP Artifact Registry.

gcloud artifacts repositories create airflow-custom-images \
--repository-format=docker \
--location=europe-west1 \
--description="Airflow docker image repository" \
--project=$PROJECT_ID

Cloud SQL

It is recommended to use an external database (like Cloud SQL Postgres) to host your Airflow (meta) data. To create a non-publicly exposed instance you can run

gcloud sql instances create postgres-airflow \
--database-version=POSTGRES_16 \
--cpu=1 \
--memory=4GB \
--region=europe-west1 \
--no-assign-ip \
--enable-google-private-path \
--root-password=DB_ROOT_PASSWORD

Make sure you replace the DB_ROOT_PASSWORD with a secure password you save. In addition, make sure you save the internal IP address of the instance, in our case DB_INTERNAL_IP = 10.132.0.1.

Log bucket

The last set-up step is to create a GCS bucket that can host your Airflow logs. Make sure you save the name of the bucket.

gcloud storage buckets create gs://<YOUR BUCKET NAME> \
--location="europe-west1"

Building a custom Airflow image

To take full control over what is deployed on the cluster a custom Airflow image is built.

An example of how the Dockerfile could look like is given below:

# Use the official Airflow image as a parent image
FROM apache/airflow:2.9.1-python3.11

USER root

# Copy requirements to working directory
COPY <PATH TO REQS>/requirements/requirements.txt /var/airflow/requirements.txt

# Set the working directory in the container
WORKDIR /var/airflow

# Create plugins and dags directory
RUN mkdir -p /var/airflow/plugins

# Create plugins and dags directory
RUN chmod -R 777 /var/airflow/plugins

# Copy plugins and dags to folders
COPY <PATH TO PLUGINS>/plugins/. /var/airflow/plugins/

USER airflow

# Install the necessary dependencies
RUN pip install \
--no-cache-dir \
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.8.4/constraints-3.11.txt" \
"apache-airflow==2.9.1" -r /var/airflow/requirements.txt

Make sure you replace the paths with your requirements file, DAGs and plugins. In our case a requirements file of just one line is used to add in the necessary provider packages. You can add any other package in the requirements, to ensure that Airflow can use these packages. Note that the base Airflow image also comes with a lot of packages pre-installed.

apache-airflow[async,google,slack,http,postgres,pagerduty,airbyte]

To then build and push this image you can use

docker build . \
-f lib_deployment/airflow_deployment/Dockerfile \
-t europe-west1-docker.pkg.dev/$PROJECT_ID/airflow-custom-images/airflow-custom:$VERSION \
--platform=linux/amd64

docker push europe-west1-docker.pkg.dev/$PROJECT_ID/airflow-custom-images/airflow-custom:$VERSION

Make sure you set the platform correctly to your machines specifications and replace $VERSION with your selected version, e.g. 0.0.1

Deploying using Helm

To deploy Airflow on our newly created Kubernetes cluster we use the official Helm chart. This requires you to install helm, to install helm please refer to the helm docs.

To retrieve the latest Airflow helm charts you can execute the following helm commands:

## Add this helm repository
helm repo add airflow-stable https://airflow-helm.github.io/charts

## Update your helm repo cache
helm repo update

Persistent volume

There’s different ways to host your Airflow DAGs, in this tutorial we will use a persistent volume that hosts the DAGs. Using this persistent volume you can either use Git-sync or manually push DAG updates to the volume.

In my experience Git-sync makes sense when you want to always sync a certain branch of a repository to your Airflow, e.g. in a development environment. However, when dealing with production this is less desirable, making “manual” sync more desirable.

First create a persistent disk using:

gcloud compute disks create --size=10GB --zone=$ZONE airflow-dags-disk

You can use the following k8s configuration persistent-volumes.yamlto apply to your cluster to set-up the volumes:

apiVersion: v1
kind: PersistentVolume
metadata:
name: airflow-dags-pv
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
storageClassName: "standard"
gcePersistentDisk:
pdName: airflow-dags-disk
fsType: ext4
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: airflow-dags-pvc
namespace: airflow
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
storageClassName: "standard"

To apply this you can use

kubectl apply -f ./persistent-volumes.yaml -n "airflow"

Helm chart custom values

As a baseline values.yaml the example GKE custom-values.yaml is used from the official helm chart repo. With a few adjustments:

The airflow.config section is changed to use remote logging using the default GCP connection. In addition the core plugins folder is set to the plugins that are pushed to the custom Airflow image. Make sure to replace the Log bucket with the bucket you have created previously.

config:
AIRFLOW__WEBSERVER__EXPOSE_CONFIG: "False"
AIRFLOW__CORE__LOAD_EXAMPLES: "False"

## remote log storage
AIRFLOW__LOGGING__REMOTE_LOGGING: "True"
AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: gs://<LOG BUCKET>/airflow/logs
AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "my_gcp"

## plugins
AIRFLOW__CORE__PLUGINS_FOLDER: "/var/airflow/plugins"

The airflow.image section is changed to use the Artifact repo you’ve created

  image:
repository: europe-west1-docker.pkg.dev/<PROJECT ID>/airflow-custom-images/airflow-custom
tag: <YOUR TAG>
pullPolicy: IfNotPresent

The web.service section is changed to use a pre-defined load balancer.

service:
type: LoadBalancer
loadBalancerIP: <INTERNAL IP>
externalPort: 443
loadBalancerSourceRanges: []
annotations:
cloud.google.com/load-balancer-type: "Internal"

The dags section is changed to the correct folder and persistent volume.

dags:
## the airflow dags folder
path: /var/airflow/dags

## configs for the dags PVC
persistence:
enabled: true
existingClaim: airflow-dags-pvc
accessMode: ReadOnlyMany
mountPath: /var/airflow/dags

The serviceAccount section is changed to use Workload Identity of the created service account.

serviceAccount:
## if a Kubernetes ServiceAccount is created
create: true

## the name of the ServiceAccount
name: "airflow"

## annotations for the ServiceAccount
annotations:
iam.gke.io/gcp-service-account: "airflow-kubernetes@<PROJECT_ID>.iam.gserviceaccount.com"

The external database is set to use the Cloud SQL Postgres instance:

externalDatabase:
type: postgres

## the address of the external database
host: 10.132.0.1
port: 5432

## the database which will contain the airflow tables
database: airflow

## the name of a pre-created secret containing the external database user
userSecret: "airflow-cluster-postgres-credentials"
userSecretKey: "username"

## the name of a pre-created secret containing the external database password
passwordSecret: "airflow-cluster-postgres-credentials"
passwordSecretKey: "password"

To set the amount of schedulers you should adjust scheduler.replicas to the desired number (we picked 2), and the same with the workers using workers.replicas.

Secrets

As you might see in previous steps secrets are used to store any sensitive information such as usernames and passwords. There’s different ways to deal with this but here we choose to create k8s configuration files which can directly be applied to the cluster.

You will need to apply at least the following one:

Postgres credentials (postgres-secrets.yaml):

apiVersion: v1
kind: Secret
metadata:
name: airflow-postgres-credentials
stringData:
username: "postgres"
password: "<YOUR PASSWORD>"

And apply this through

kubectl apply -f ./postgres-secrets.yaml -n "airflow"

Note that you can also create a specific airflow-user on Postgres if you do not want to use the root postgres user. You can also store other secrets in this manner and use them in your custom-values.yaml.

Deploy

The last step is to deploy Airflow using the custom-values.yaml you have created. To do this you can now execute:

helm install \
deploy-airflow \
airflow-stable/airflow \
--namespace airflow \
--version "8.9.0" \
--values ./custom-values.yaml

This should successfully deploy Airflow to your GKE cluster. It can take up to 10 minutes for the deployment to be complete, as first your database will be set-up after which all services will be started.

Access Airflow UI

To access the Airflow UI you can use:

kubectl port-forward svc/airflow-web 8080:443 --namespace airflow 2>&1 >/dev/null &

This will make the Airflow UI locally accessible at localhost:8080.

Sync DAGs to persistent volume.

You can sync DAGs to the persistent volume by creating a pod that removes the contents of the DAGs folder and copies the latest contents. Note that in a production environment this might cause DAGs to be “gone” for a limited amount of time. Hence just copying the changes is a better alternative. Nevertheless this is how you can sync the DAGs:

export ROOT_FOLDER_DAGS = your/path/to/dags

kubectl run airflow-dags-sync \
--image=busybox \
--restart=Never \
--overrides='{"apiVersion": "v1", "spec": {"containers": [{"name": "airflow-dags-sync", "image": "busybox", "command": ["tail"], "args": ["-f", "/dev/null"], "volumeMounts": [{"mountPath": "var/airflow/dags", "name": "airflow-dags-pv"}]}], "volumes": [{"name": "airflow-dags-pv", "persistentVolumeClaim": {"claimName": "airflow-dags-pvc"}}]}}'
while [[ $(kubectl get pod -l run=airflow-dags-sync -o 'jsonpath={..status.conditions[?(@.type=="Ready")].status}') != "True" ]]; do
echo "Waiting for the pod to be Ready..."
sleep 2
done

POD_NAME=$(kubectl get pod -l run=airflow-dags-sync -o jsonpath="{.items[0].metadata.name}")

kubectl exec $POD_NAME -- sh -c 'rm -rf /var/airflow/dags/*' && kubectl cp ./$ROOT_FOLDER_DAGS/dags/. $POD_NAME:/var/airflow/dags/

kubectl delete pod airflow-dags-sync

This will from the your ROOT_FOLDER_DAGS directly sync the DAGs to your GKE Airflow persistent volume. It might take a little while for the DAGs to show up. When using GitHub/Gitlab you can for example run this in a repo using actions/pipelines.

Debug problems

If the deployment fails for any reason it usually happens when trying to do the db migrations

It’s usually the case that your database cannot be reached. I recommend to check that your secrets are defined on the cluster. In most cases you can look at the logs of the check-db container to see what the error is.

In any case Google Cloud Logging is your best friend to show you where the error is coming from.

Final remarks

I hope you found this detailed tutorial useful. In my case I could have definitely used a starting point to deploy Airflow on k8s, which this tutorial tries to do. If you have any remarks, feedback or questions, please let me know.

In the short term I will update this post such that all of the above can be found in a GitHub repo. Stay tuned!

EDIT:
The GitHub repo is now live: https://github.com/rodelvr/helm-gke-deployments.

Also updated to being tested on Airflow 2.9.1 and Helm chart version 8.9.0.

--

--