Deploy Airflow 1.10.10 in Kubernetes using Terraform and Helm

Published in

Typeform's Engineering Blog

7 min readMay 8, 2020

The Data Platform team at Typeform is a combination of multidisciplinary engineers, that goes from Data to Tracking and DevOps specialists. We use Airflow, love Kubernetes, and deploy our infrastructure using Terraform, so we asked ourselves:

Can we move from our current Airflow running on a single EC2 instance to a new system that combines all the good parts of Kubernetes (and new Airflow’s Kubernetes Operator) and Terraform?

Of course, the answer was “at least, let’s try it”. And few weeks later, we are delighted to share with you our migration journey.

This article is the first part of our series of posts related with the migration, and next releases will also cover some cool features like running Spark jobs from Airflow using the benefits of K8s.

Migration scheduling and phase definition

First step on such an important migration, is defining the scope of the project. What do we want to achieve? How do we want to do it? When do we want to have a “stable” solution in place?

Our specific scenario

At Typeform, we use Airflow to run almost all our data ETL jobs, as well as some interesting jobs related with ML model training among others, so our technology stack uses a combination of Python, Bash, SQL, Docker and Spark operators.

Given the fact that we wanted to divide the migration in several phases, our idea was to start with:

Phase 1: Migrate Airflow, and just make it work
Phase 2: Migrate Python, SQL and Bash operator jobs
Phase 3: Migrate Docker and Spark operators

The complex part on this journey we knew that it was going to be Docker (do you remember headaches trying to run docker in docker, with the complexity of accessing AWS credentials available in instance profile metadata?) and Spark (here we wanted to move away from classic EMR clusters and run Spark natively in K8s)

Phase 1. About the magic of combining Airflow, Kubernetes and Terraform

Before this migration, we also completed one of our biggest projects, which consisted in migrating almost all our services to new Kubernetes data clusters. We wanted to do it in a simple but elegant way and we decided to use the Terraform modular approach. We defined our EKS clusters using Terraform, and also a couple of modules for what we called “Cluster bootstraping” like: installing all the required base namespaces, monitoring tools, ingress controllers, and so on.

That approach worked so well for us, so we wanted to keep using it for Airflow.

Where Terraform meets Helm

Terraform is such a powerful tool, and most of you probably know it because you use it to define your AWS or GCP infrastructure, but in fact, if we scratch a little bit the surface of this monster, we will find something called providers and specially, one really interesting for our use case: Helm provider

Helm helps us a lot with the heavy lifting of complex Kubernetes deployments, and you know, whenever you can follow the KISS principle, just follow it. We decided to use the community Helm chart for Airflow, to which we also had the need to contribute fixing some bugs (good things of Open Source projects)

Ok, but show me the code

Let’s start with an overview of our Airflow Terraform module

# This is a brief overview of the complete project structure├── data_platform
│   └── k8s
│       ├── airflow.tf --> The interesting one for this article
│       ├── eks.tf
│       ├── eks_bootstraping.tf
│       ├── irsa_airflow.tf
│       ├── main.tf
│       ├── outputs.tf
│       ├── secrets.tf
│       └── variables.tf
└── modules
    └── k8s
        ├── airflow --> The interesting one for this article
        │   ├── airflow.tf
        │   ├── helm_values
        │   │   └── values.yaml
        │   ├── main.tf
        │   └── variables.tf
        ├── eks_cluster
        │   ├── ...
        │   └── ...
        └── eks_cluster_bootstraping
            ├── ...
            └── ...

As the code is based on a module, and the instance of that module, we will focus first of all on the module itself, understanding all the parts involved

modules/k8s/airflow/variables.tf

The content of this file includes all the required variables that allow the possibility to launch as many times as needed (or in different clusters) this module

Most important variables we should focus on here:

cluster_id used for obtaining an access token and interacting with kubelet via Terraform (more information later on)
airflow_dns_name we use this pice of cake, so really useful in combination of nginx ingress controllers
ingress_class allows us to expose each Airflow instance to different access levels or specific configurations defined in a specific ingress class
irsa_assumable_role_arn this is where things start to get a little bit more tricky. At Typeform data platform we use one of the new coolest features of EKS called IRSA, which means that a K8s service account can be integrated with IAM and assume a certain set of roles. Benefits of this approach? A LOT, starting from the assumption that you no longer need to grant EKS worker nodes (instance profiles) a wide set of permissions that all running pods can use, or even worse, have to deal with 3rd party tools like Kube2IAM or KIAM. Don’t worry for now about IRSA, it’s explained later on in more depth in this article.

Now it’s time to explain the magic under /modules/k8s/airflow/main.tf

Nothing really special here, but the fun fact of using kubernetes provider by only passing a simple cluster_id variable and with the help of the data resource aws_eks_cluster_auth. Thank you Terraform for making our lifes so easy!! No more kubeconfig files for dealing with clusters. As long as your initiating Terraform command is assuming a role (or using an IAM user) that is allowed to interact with EKS IAM Authenticator, you’re safe :)

It’s time to define the proper Airflow Helm chart in Terraform, so let’s move on into the next file:

/modules/k8s/airflow/airflow.tf

In this file, we define all the helm variables that need to be dynamic, leaving all the rest for the imported yaml file in:

/modules/k8s/airflow/helm_values/values.yaml

Ok, you maybe got information overflow, right?

So let’s stripe it in smaller pieces:

Service accounts come into scene! Maybe one of the most important things to get IRSA working. All pods will be executed using this service account, so it’s crucial to define it.

Dags section can vary a lot depending on your needs, but we went for one of the simplest ways to configure and sync DAG folders into Airflow volumes: Git sync (a simple container that runs a git pull from your desired repository every refreshTime seconds)

As we are using a custom Airflow docker image that has all our requirements already installed (and therefore allowing us faster startup times) we disable the installRequirements feature as it’s not needed. Resources are defined because:

a) It’s a good practice

b) It’s a requirement in order to activate later on HPA on Airflow worker nodes

Airflow section is other piece that can be really different depending on your special needs, but as a brief explanation, we use a custom Docker image and have Google oAuth activated as default auth backend. Important note on ENABLE_PROXY_FIX feature, as it’s needed if you’re running your Airflow webserver behind a LoadBalancer with ProxyV2 enabled, as in our case with an AWS Classic Load Balancer.

Workers, web and scheduler sections share pretty much all the parameters, so let’s focus on workers, which are the special case.

As part of the first iteration, we wanted to keep using Celery executor (instead of Kubernetes Executor) for a simple and smooth migration in Phase 1 so it made lot of sense to enable HPA for worker nodes and replicate an “old” elastic load balancer for EC2 instances, inside K8s. More info about how Airflow can schedule jobs can be found here.

The autoscaling section enables that HPA based on our standard or custom defined metrics. This is also the main reason to have resources limits and requests defined for all the pods.

Really special mention to the securityContext. We had to fight against issues with the file reading permissions related to the AWS session token that EKS injects into the container and stores the STS credentials used by IRSA and the corresponding language SDK. Without this “ugly” fix, the default mount mode ( 600 ) for the token made it impossible to be read if you don’t start your container running as root. More info here

Finally it’s time to instantiate our new module in /data_platform/k8s/airflow.tf

And finally, let’s get some fun with the coolness of integrating K8s service accounts and IAM roles in /data_platform/k8s/irsa_airflow.tf

Using a custom external module that allows creating a role, which can be assumed by an OIDC realm (our K8s service account) and accepts as many IAM policies attached as we want to use :)

With all that ingredients in place, it’s time to terraform apply and have fun with a very basic, but functional Airflow deployment running on Kubernetes and using HPA for workers autoscaling.

In the next chapters of this article series, we will talk about how to migrate current DAGs to use kubernetesOperator , how to use kubernetesExecutor , Spark on K8s clusters and many more interesting things.