Deploying Apache Airflow on Kubernetes for local development

Hugo Sykes
Go City Engineering
5 min readMay 18, 2022

Steps to run Apache Airflow 2.3.0 on a local Kubernetes cluster using kind and helm

Apache Airflow is an extremely flexible and scalable data orchestration tool for authoring, monitoring and scheduling DAGs (Directed Acyclic Graphs). It offers out of the box operators for interacting with all kinds of analysis, storage and transformation services. Here at Go City, we’re in the midst of migrating our data pipelines over to Airflow.

Extra reading

To find more information about other modes of installation and some of the available operators, look no further than the official documentation.

This blog post is inspired by another one by Marc Lamberti, a great source of Airflow-related knowledge and wisdom. While his blog covers lots of very helpful ground, it omitted how to load up DAGs which are stored locally which I’ve attempted to rectify here.

Prerequisites

There are a few tools required for this particular install. I use brew on my Mac but installing these tools should be easy enough on other operating systems.

You must have a Docker daemon running.

brew install helm kind kubectl yq

helm manages the Airflow deployment using a pre-made ‘chart’ and makes Airflow deployment easy.

kind makes running a local Kubernetes cluster very simple.

kubectl is used to interact with a Kubernetes cluster.

yq manipulates yaml files and simplifies editing them from the command line.

Setup

1. Clone the GitHub repository

git clone git@github.com:gocityengineering/airflow-local-setup
cd airflow-local-setup

2. Set an environment variable pointing to your local DAGs

export LOCAL_DAGS_FOLDER=/path/to/your/local/dags

From now on, you have two options. Run ./setup.sh which will do everything for you or follow each step of the process for better understanding.

3. Set configuration for mounting your local DAGs folder to the Kubernetes cluster

The following command sets up mounting your DAGs folder to each of the Kubernetes nodes. This is crucial for connecting your local filesystem directly to each Kubernetes pod that will be running a part of the Airflow cluster.

yq -i "
.nodes[1].extraMounts[1].hostPath = \"$LOCAL_DAGS_FOLDER\" |
.nodes[1].extraMounts[1].containerPath = \"/tmp/dags\" |
.nodes[2].extraMounts[1].hostPath = \"$LOCAL_DAGS_FOLDER\" |
.nodes[2].extraMounts[1].containerPath = \"/tmp/dags\" |
.nodes[3].extraMounts[1].hostPath = \"$LOCAL_DAGS_FOLDER\" |
.nodes[3].extraMounts[1].containerPath = \"/tmp/dags\"
" kind-cluster.yaml

If you look in the kind-cluster.yaml file, you should now see that yq has added an extraMount to each of your nodes.

4. Create the kind cluster

We need a Kubernetes cluster on which to run Airflow and kind makes that very simple.

kind create cluster --name airflow-cluster --config kind-cluster.yaml

Once this command has completed, you can run the following to query the initial state of the cluster.

kubectl cluster-info
kubectl get nodes -o wide

5. Create airflow namespace

To adequately separate concerns in our Kubernetes cluster, it is common practice to create a namespace for Airflow which we will do using the following command:

kubectl create namespace airflow

You can check this was successful by listing all namespaces:

kubectl get namespaces

6. Generate a Fernet key

Documentation on what Fernet means and why it is used can be found here. It is possible to setup Airflow without generating a Fernet key but it leads to instability and the documentation recommends generating one yourself. It is required for this setup and can be achieved using the following command.

kubectl -n airflow create secret generic my-webserver-secret --from-literal="webserver-secret-key=$(python3 -c 'import secrets; print(secrets.token_hex(16))')"

For clarity, let’s break this command down. Firstly, we’re interacting with our Kubernetes cluster using kubectl.

Then adding -n airflow to specify that we’re using the airflow namespace we created above.

create secret creates a Kubernetes secret. Who knew.

The generic part indicates that we want it to be an Opaque type secret which means that it contains unstructured secret data, as opposed to, e.g., a service-account-token type secret which has a specific structure to its data. Docs here.

my-webserver-secret is the name of the secret.

Lastly, the --from-literal=... section is the content of the secret. The key being webserver-secret-key and the value being the result of that small bit of Python code.

7. Create the PersistentVolume and PersistentVolumeClaim

To finish setting up the link between your local DAGs files and the soon-to-be-running Airflow cluster, we need to setup the Kubernetes resources which allow such a connection.

kubectl apply -f dags_volume.yaml

This command applies the configuration in the file to the Kubernetes cluster. Kubernetes can be very complicated so I will delegate the explanation of these resources to the official documentation.

8. Add and update the Airflow helm repository

We are using an official helm chart so as to not get bogged down in the details of installing all of Airflow’s dependencies.

helm repo add apache-airflow https://airflow.apache.org
helm repo update

9. Install Airflow

Here we use helm to install Airflow from the official chart using the configuration that we have in values.yaml:

helm install airflow apache-airflow/airflow --namespace airflow --debug -f values.yaml

Important config in values.yaml

webserverSecretKeySecretName: my-webserver-secret

This ensures that we use the Fernet key that we created above.

dags:
persistence:
# Enable persistent volume for storing dags
enabled: true
# Volume size for dags
size: 1Gi
# If using a custom storageClass, pass name here
storageClassName:
# access mode of the persistent volume
accessMode: ReadWriteOnce
## the name of an existing PVC to use
existingClaim: airflow-dags

Other than enabled: true, the key line here is existingClaim: airflow-dags . This ensures that Airflow knows from where to pick up the local DAGs as we’ve mounted them through that PersistentVolumeClaim (PVC).

executor: "KubernetesExecutor"

Again, the documentation explains this better than I.

Accessing the UI

The following command forwards the port 8080 on the webserver pod (through its service or svc) to port 8080 on your machine. Navigate to http://localhost:8080 to view the UI and see your DAGs.

kubectl port-forward svc/airflow-webserver 8080:8080 -n airflow

Conclusion

If all this setup has worked as expected then you should be able to view your DAGs in your browser and they should update in real time as you edit them. In my experience this is an optimal setup as you can instantly test whether the code you’re writing functions as expected.

--

--