Airflow On Azure Kubernetes — Part 1

Chen Meyouhas
9 min readNov 13, 2019

--

Background

At ciValue, our various data pipelines and maintenance workflows needs drove us to explore some of the widely adopted workflow solutions out there.

After several POCs, here’s why we chose Apache Airflow:

  • It’s an open source project with wide community support
  • It’s cloud agnostic
  • Workflows as code
  • Written in Python and has many operators for different services, such as Databricks, PostgreSQL, SSH, Bash, Slack and more.

Airflow Architecture

In general, Airflow can be deployed in two different kinds of architecture:

  1. Single node – all airflow components are installed on one machine
  2. Multi node – each airflow component is installed on a different machine

The single node architecture might be good for development while multi node architecture for production environments.

In this post I’m going to show how to easily deploy and manage multi node Airflow architecture on AKS (Azure Kubernetes service), using the official Airflow Helm chart.

First, let’s have a quick review on how Airflow components interact in a multi node architecture:

There are five different kinds of Airflow components:

  1. Webserver
    Exposing the Airflow WebUI to let a user manage his workflows, configure global variables and connections interactively.
    It also accepts REST API requests to interact with a DAG (trigger a DAG, get information of a DagRun, and more…)
    https://airflow.apache.org/api.html
  2. Database
    The database is responsible for storing all the Airflow metadata such as Dags, DagRuns, task instances, XCOMs (for sharing data between tasks), Variables, users and connections.
  3. Scheduler
    The scheduler periodically checks if there are new DAGs that should be registered and if a DAG and its task instances should be triggered and sent to the queue.
  4. Queue
    In order to work at scale with multiple workers, Airflow uses a queue to store tasks which should be then polled and executed by the Airflow workers.
  5. Worker
    A worker polls a task from the queue and executes the task logic.
    Number of concurrent tasks a worker can handle at a time can be configured.

Here are some reasons why deploying such an architecture on Kuberenetes with Helm is a good idea:

  • Fast initial deployment:
    Kubernetes pulls the requested container images from a container registry and deploys them on the Airflow pods. It builds the internal network between the Airflow components and exposes the Airflow WebUI as a network service. It also provisions cloud storage, mounts it to the database pod and creating web server and worker replicas for a highly available Airflow application. In our test environment, all the above took 5 minutes.
  • Easy to upgrade and rollback versions:
    Thanks to the Helm package manager, deploying new versions or rolling back to previous versions becomes an easy and fast task.
  • Kubernetes autoscaler:
    Kubernetes automatically changes the number of cluster nodes to meet the pods resources demand.
    For example, changing the number of Airflow workers or changing its required resources, may trigger the autoscaler.
  • Cloud independent:
    Deploying the same Airflow architecture with the same Helm chart in every cloud provider

How to deploy Airflow on AKS?

Prerequisites

  • Python (v2.7)
  • Azure CLI (v2.0.76)
  • kubectl CLI (v1.15.3)
  • Helm CLI (v2.12.1)

Configure a new AKS
First, let’s create a service principal “service-principal-demo” for the cluster. A service principal is needed for the cluster to interact with Azure resources. For example: creating a persistent volume for the database pod.

$ az ad sp create-for-rbac --skip-assignment --name service-principal-demo

Save the response JSON, we will need it when creating the AKS.

Create a new resource group “airflow-aks-demo-rg”

$ az group create --name airflow-aks-demo-rg --location westeurope

Now, let’s create a new AKS “airflow-aks-demo” in the new resource group “airflow-aks-demo-rg”

We will provide the following arguments:

  • Service principal application id and password as the “service-principal” and “client- secret” accordingly
  • AKS version as the “kubernetes-version”
  • Initial number of cluster nodes as “node-count”
  • Cluster nodes type and size as “node-vm-size”
  • “enable-cluster-autoscaler” if we want the cluster to autoscale upon new resources demand
  • “VirtualMachineScaleSets” as “vm-set-type” – (if we enable cluster autoscaler)
  • Minimum and maximum number of cluster nodes as “min-count“ and “max-count” (if we enable cluster autoscaler)
  • Location where we want the cluster to be deployed as “location”

Note:
The following command will automatically deploy a new virtual network with default address space 10.0.0.0/8. If we want to use an existing virtual network, we should provide “vnet-subnet-id” as well.
Also, the docker bridge address defaults to 172.17.0.1/16, so we need to make sure it doesn’t overlap with any other subnet in our subscription. If it does overlap, we might want to provide an existing address space as “docker-bridge-address”

$ az aks create \
--resource-group airflow-aks-demo-rg \
--name airflow-aks-demo \
--service-principal ABCDABCD-ABCD-ABCD-ABCD-ABCDABCDABCD \
--client-secret ABCDABCD-ABCD-ABCD-ABCD-ABCDABCDABCD \
--kubernetes-version 1.13.12 \
--node-count 1 \
--vm-set-type VirtualMachineScaleSets \
--node-vm-size Standard_D2s_v3 \
--load-balancer-sku basic \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 2 \
--location westeurope

The output is a large JSON object describing the AKS deployment.

Configure the kubectl to connect to “airflow-aks-demo” AKS by downloading credentials and adding them to ~/.kube/config

$ az aks get-credentials --resource-group airflow-aks-demo-rg --name airflow-aks-demo

Bind kubernetes-dashboard service to cluster-admin role to get access to the dashboard:

$ kubectl create clusterrolebinding kubernetes-dashboard --clusterrole=cluster-admin --serviceaccount=kube-system:kubernetes-dashboard

Installing Helm server
Create service account “tiller”

$ kubectl -n kube-system create serviceaccount tiller

In order to let Helm manage the cluster resource, the tiller service needs a cluster-admin role:

$ kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller

Deploy the Helm server (Tiller)

$ helm init --service-account tiller --wait

Let’s verify the Tiller has been successfully deployed

$ kubectl get deployments -n kube-system

We should see a new deployment called “tiller-deploy”.

Alternatively, we can verify this using the Kubernetes dashboard:

$ az aks browse --resource-group airflow-aks-demo-rg --name airflow-aks-demo

Installing Airflow using Helm package manager
Let’s create a new Kubernetes namespace “airflow” for the Airflow application

$ kubectl create ns airflow

Generate secrets for postgres and redis components and add them under “airflow” namespace:

$ kubectl create secret generic airflow-postgres -n airflow --from-literal=postgres-password=$(openssl rand -base64 13)$ kubectl create secret generic airflow-redis -n airflow --from-literal=redis-password=$(openssl rand -base64 13)

Check if they were created

$ kubectl get secrets -n airflow

Clone the following helm chart:
https://github.com/helm/charts/tree/master/stable/airflow

Configure the following secret names for postgres and redis components in the values.yaml file of the Airflow chart.

postgresql:
## The name of an existing secret that contains the postgres password.
existingSecret: airflow-postgres
redis:
## The name of an existing secret that contains the redis password.
existingSecret: airflow-redis

Generate fernet key to enable password encryption when creating a new connection.
First, install the crypto package:

$ pip install 'apache-airflow[crypto]'

Then, get the fernet key:

$ python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)"

Update the generated key in the values.yaml file:

airflow:
fernetKey:
ABCDABCDABCDABCDABCDABCDABCDABCDABCDABCD

If we explore the requierments.yaml file of the Airflow chart, we will notice that this chart has two dependencies, postgresql and redis.
Let’s install these dependencies:

Execute under the Airflow chart directory:

$ helm dep update

Make sure the dependencies are in status “ok”:

$ helm dep list

Now we are ready to install the Airflow application.
First, let’s install it in a “dry-run” mode to make sure the generated charts are valid:

$ helm install --name airflow  --namespace airflow --debug . --dry-run

The output is a large YAML describing the airflow deployment.
Let’s run it again without the “dry-run” flag and check out the pods statuses

$ helm install --name airflow  --namespace airflow --debug .$ kubectl get pods -n airflow --watch

Open access to the Airflow WebUI

Usually, this kind of deployment (internal workflows) should not be accessed through the public network, therefore in this post the way to access the Airflow WebUI is by using VPN gateway that is peered to the AKS virtual network.
Assuming you have a connection from your local machine to your Azure VPN gateway, and the gateway is peered to the AKS virtual network, let’s configure a load balancer that will expose the WebUI to a private IP in the AKS subnet:

In this example the AKS subnet is 10.97.0.0/16 so I’m going to use 10.97.0.200 as the load balancer IP. (make sure the chosen IP is not already taken by another resource)

Update the following values in the values.yaml file:

airflow:
service:
annotations:
{service.beta.kubernetes.io/azure-load-balancer-internal: "true"}
type: LoadBalancer
loadBalancerIP: 10.97.0.200
externalPort: 8080
nodePort:
http:

Now we can upgrade the Airflow application:

$ helm upgrade airflow . --debug

Verify that the “airflow-web” service got the correct external-IP

$ kubectl get services -n airflow –watch

Then, navigate to 10.97.0.200:8080...

Congratulations! We’ve just deployed Apache Airflow on Azure Kubernetes Service!

Triggering the AKS autoscaler

If we look again how we configured the AKS, we will notice that:

  • Cluster nodes type is Standard_D2s_v3 (2 cores and 8 GB memory)
  • Default node count is 1
  • Cluster autoscaler is enabled.

List the current pods and its nodes:

$ kubectl get pods -n airflow -o wide --watch

We can see that all the Airflow pods are deployed on the cluster node
“aks-nodepool1-12545537-vmss000000”.
Let’s deploy a new version to trigger the autoscaler to add another cluster node.

Add another Airflow worker and configure each worker to have 1 CPU.

##
## Workers configuration
workers:
enabled:
true
##
## Number of workers pod to launch
replicas: 2
##
## Custom resource configuration
resources:
limits:
cpu:
"1"
memory: "2G"
requests:
cpu:
"1"
memory: "2G"

This change should trigger the autoscaler as the AKS has just one cluster node with 2 CPUs and we are now going to request more than 2 CPUs (2 for the workers and some more for the other Airflow components)

Upgrade the airflow application and watch the new pod creation:

$ helm upgrade airflow . –debug$ kubectl get pods -n airflow -o wide –watch

First, we will see a new worker pod in a “pending” status as it’s actually waiting for new resources

Then, after several minutes, a new cluster node is automatically added and the new worker pod is running on the new cluster node
“aks-nodepool1-12545537-vmss000003”

Rolling back to a previous version

Let’s rollback to a version when we had only one Airflow worker.
First, check the revisions statuses

$ helm history airflow

We can see that revision 3 of the “airflow” release is currently deployed.
Revision 2 is the version when we had one worker and the load balancer is configured, so rollback to revision 2:

$ helm rollback airflow 2

Check out the pods statuses

$ kubectl get pods -n airflow --watch

Pod “airflow-worker-1” has changed its status to “Terminating” and about to disappear

Check the airflow revisions again

$ helm history airflow

We rolled back to revision 2!

That’s it for now! I hope you found this post useful and informative!
In part II of the post, I’ will overview advanced Airflow configuration topics, including:

  • Best practice for deploying DAGs in production
  • Azure Container Register integration for deploying private docker images
  • Configuring Azure file as a shared storage between Airflow workers
  • Configuring static Azure disk as the Airflow database storage
  • Azure key vault integration for saving secrets

References:

https://airflow.apache.org/
https://github.com/helm/charts/blob/master/stable/airflow/README.md
https://docs.microsoft.com/bs-latn-ba/azure/aks/configure-azure-cni
https://docs.microsoft.com/en-us/azure/aks/cluster-autoscaler

--

--