Deploying Apache Airflow on Kubernetes for local development
Steps to run Apache Airflow 2.3.0 on a local Kubernetes cluster using kind and helm
Apache Airflow is an extremely flexible and scalable data orchestration tool for authoring, monitoring and scheduling DAGs (Directed Acyclic Graphs). It offers out of the box operators for interacting with all kinds of analysis, storage and transformation services. Here at Go City, we’re in the midst of migrating our data pipelines over to Airflow.
Extra reading
To find more information about other modes of installation and some of the available operators, look no further than the official documentation.
This blog post is inspired by another one by Marc Lamberti, a great source of Airflow-related knowledge and wisdom. While his blog covers lots of very helpful ground, it omitted how to load up DAGs which are stored locally which I’ve attempted to rectify here.
Prerequisites
There are a few tools required for this particular install. I use brew on my Mac but installing these tools should be easy enough on other operating systems.
You must have a Docker daemon running.
brew install helm kind kubectl yq
helm manages the Airflow deployment using a pre-made ‘chart’ and makes Airflow deployment easy.
kind makes running a local Kubernetes cluster very simple.
kubectl is used to interact with a Kubernetes cluster.
yq manipulates yaml files and simplifies editing them from the command line.
Setup
1. Clone the GitHub repository
git clone git@github.com:gocityengineering/airflow-local-setup
cd airflow-local-setup
2. Set an environment variable pointing to your local DAGs
export LOCAL_DAGS_FOLDER=/path/to/your/local/dags
From now on, you have two options. Run ./setup.sh
which will do everything for you or follow each step of the process for better understanding.
3. Set configuration for mounting your local DAGs folder to the Kubernetes cluster
The following command sets up mounting your DAGs folder to each of the Kubernetes nodes. This is crucial for connecting your local filesystem directly to each Kubernetes pod that will be running a part of the Airflow cluster.
yq -i "
.nodes[1].extraMounts[1].hostPath = \"$LOCAL_DAGS_FOLDER\" |
.nodes[1].extraMounts[1].containerPath = \"/tmp/dags\" |
.nodes[2].extraMounts[1].hostPath = \"$LOCAL_DAGS_FOLDER\" |
.nodes[2].extraMounts[1].containerPath = \"/tmp/dags\" |
.nodes[3].extraMounts[1].hostPath = \"$LOCAL_DAGS_FOLDER\" |
.nodes[3].extraMounts[1].containerPath = \"/tmp/dags\"
" kind-cluster.yaml
If you look in the kind-cluster.yaml
file, you should now see that yq has added an extraMount
to each of your nodes.
4. Create the kind cluster
We need a Kubernetes cluster on which to run Airflow and kind makes that very simple.
kind create cluster --name airflow-cluster --config kind-cluster.yaml
Once this command has completed, you can run the following to query the initial state of the cluster.
kubectl cluster-info
kubectl get nodes -o wide
5. Create airflow namespace
To adequately separate concerns in our Kubernetes cluster, it is common practice to create a namespace for Airflow which we will do using the following command:
kubectl create namespace airflow
You can check this was successful by listing all namespaces:
kubectl get namespaces
6. Generate a Fernet key
Documentation on what Fernet means and why it is used can be found here. It is possible to setup Airflow without generating a Fernet key but it leads to instability and the documentation recommends generating one yourself. It is required for this setup and can be achieved using the following command.
kubectl -n airflow create secret generic my-webserver-secret --from-literal="webserver-secret-key=$(python3 -c 'import secrets; print(secrets.token_hex(16))')"
For clarity, let’s break this command down. Firstly, we’re interacting with our Kubernetes cluster using kubectl
.
Then adding -n airflow
to specify that we’re using the airflow namespace we created above.
create secret
creates a Kubernetes secret. Who knew.
The generic
part indicates that we want it to be an Opaque type secret which means that it contains unstructured secret data, as opposed to, e.g., a service-account-token type secret which has a specific structure to its data. Docs here.
my-webserver-secret
is the name of the secret.
Lastly, the --from-literal=...
section is the content of the secret. The key being webserver-secret-key
and the value being the result of that small bit of Python code.
7. Create the PersistentVolume and PersistentVolumeClaim
To finish setting up the link between your local DAGs files and the soon-to-be-running Airflow cluster, we need to setup the Kubernetes resources which allow such a connection.
kubectl apply -f dags_volume.yaml
This command applies the configuration in the file to the Kubernetes cluster. Kubernetes can be very complicated so I will delegate the explanation of these resources to the official documentation.
8. Add and update the Airflow helm repository
We are using an official helm chart so as to not get bogged down in the details of installing all of Airflow’s dependencies.
helm repo add apache-airflow https://airflow.apache.org
helm repo update
9. Install Airflow
Here we use helm to install Airflow from the official chart using the configuration that we have in values.yaml:
helm install airflow apache-airflow/airflow --namespace airflow --debug -f values.yaml
Important config in values.yaml
webserverSecretKeySecretName: my-webserver-secret
This ensures that we use the Fernet key that we created above.
dags:
persistence:
# Enable persistent volume for storing dags
enabled: true
# Volume size for dags
size: 1Gi
# If using a custom storageClass, pass name here
storageClassName:
# access mode of the persistent volume
accessMode: ReadWriteOnce
## the name of an existing PVC to use
existingClaim: airflow-dags
Other than enabled: true
, the key line here is existingClaim: airflow-dags
. This ensures that Airflow knows from where to pick up the local DAGs as we’ve mounted them through that PersistentVolumeClaim (PVC).
executor: "KubernetesExecutor"
Again, the documentation explains this better than I.
Accessing the UI
The following command forwards the port 8080 on the webserver pod (through its service or svc
) to port 8080 on your machine. Navigate to http://localhost:8080 to view the UI and see your DAGs.
kubectl port-forward svc/airflow-webserver 8080:8080 -n airflow
Conclusion
If all this setup has worked as expected then you should be able to view your DAGs in your browser and they should update in real time as you edit them. In my experience this is an optimal setup as you can instantly test whether the code you’re writing functions as expected.