Flexible Workflow Orchestration at CDL

CDL Data Intelligence Team

Published in

Compass Digital

5 min readJul 9, 2020

Improving the Airflow development experience using containers.

Apache Airflow’s Logo (source)

As a team, we’re always searching for insights; it only made sense that we’d want to have a deeper look inside the activity of our workflows.

Earlier this year, our Data Intelligence department here at Compass Digital Labs (CDL) decided to deploy Apache Airflow to orchestrate and automate several workflows. Before using Airflow, our team used an in-house developed scheduler tool. We chose to migrate some of our workloads to Airflow so that our team could have improved the observability of our workflows. We use Airflow in a variety of use cases:

Data ingestion into our Data Lake and our Data Warehouse (ETL) from a multitude of source systems
Automated report generation
Training and deploying machine learning models

Initial Platform

Our initial deployment was very minimalist. We had deployed Airflow on a single virtual machine (VM), using the Local Executor. We used Amazon’s Relational Database Service (RDS) as our metadata database. All of the workflows were executed on a single VM. The architecture looked something like the following:

A diagram of single node Airflow Deployment running on Amazon Web Services — Single node Airflow Deployment running on AWS

We had all of our Airflow components (web server, scheduler, and worker) running on the same VM. With the adoption of Airflow in our team steadily growing, we inevitably ran into scaling problems.

Problems often arose when trying to run many tasks concurrently. Airflow quickly ate up all the resources on the VM and sometimes would bring down the deployment due to not being able to allocate any memory.

Another issue we ran into using Airflow is that it requires the workflows to be written using Python. Members on our team had existing workflows that were written in other languages (mainly R) which were not that easy to port into Airflow because of dependency issues.

Finding the Best Options

Using a larger VM was not an optimal solution due to underutilizing the resources, as during times when not many DAGs are being scheduled to run, the provisioned resources go to waste.

We also did not use the CeleryExecutor or the KubernetesExecutor. First off, we chose not to use the CeleryExecutor due to the increased overhead of maintaining a Celery + Redis/RabbitMQ deployment. In addition, using this executor would not have solved our dependency management problem as each Celery / Airflow worker would need all the required dependencies to be installed.

As mentioned above, we also did not choose to migrate to the KubernetesExecutor. The main reason for not deploying onto this executor was again due to language and dependency constraints. If we were to build a container to deploy to ECR (Elastic Container Registry) to run our deployment on the KubernetesExecutor, each build would take 15+ minutes to install all of the R packages and dependencies.

New Platform

At a high level, our current platform continues to runs the Airflow webserver and scheduler on an EC2 VM. The newest components to our platform are a private container registry hosted on ECR, and our Kubernetes cluster, running on AWS Elastic Kubernetes Service (EKS). Our cluster has an autoscaling Node Group and it will serve as the compute resources to run Airflow Tasks. Now, team members can write DAGs that can leverage the KubernetesPodOperator to execute their own custom Docker images, stored in ECR.

A diagram of high-level architecture of our platform using Kubernetes — High-level architecture of our platform using Kubernetes

Kubernetes is a container orchestration platform, which was open-sourced by Google, that allows users to deploy, schedule, and run containers. We wanted to migrate some workflows to using KubernetesPodOperators for the following reasons:

Improve development lifecycle and process
Decouple the compute resource from the scheduling resource
Improve dependency management

Development Process

Each “project” or “DAG” will now have its own Git repository. This may seem unnecessary but has helped us improve the sustainability and maintainability of different code bases. With each project being maintained in different repositories, teams are enabled to write DAGs in any language they choose, enforce style guides, and most importantly, have a CI/CD pipeline that will deploy their DAG implementation to a private container registry.

Removing Implementation Details

Now that the containers are deployed, the only new files that are contributed to our Airflow deployment repository are Airflow DAG definition. This leaves the repo a bit cleaner as data scientists and engineers no longer need to write DAGs for Airflow, but rather just let Airflow schedule and execute the DAG, and all the implementation details are abstracted away from Airflow. Our team now is enabled to develop robust, scalable automation workflows in any language (R, Golang, Rust, etc.) and deploy it without worrying about Airflow dependencies.

Dependency Management

Without the need to maintain any dependencies for your workflow directly on your Airflow deployment, it alleviates problems with conflicting package versions as well as the need to have support for specific packages for one-off DAGs. All the dependencies are now isolated to the individual Docker container.

Trivial Example of using KubernetesPodOperator

To conclude, here’s a simple DAG that uses just a single KubernetesPodOperator.

A code snippet that is used as an example to use a KubernetesPodOperator

In this example, we have an existing container hosted in Amazon’s container registry, ECR, and we have a Kubernetes config file somewhere on our Airflow deployment. We use Airflow variables to store the container’s URL and also the path to our Kubernetes config file. Once these variables are integrated with the Airflow metadata database, we can then pass them as environment variables or command-line arguments to the Docker image that will execute on the Kubernetes Cluster. Airflow will take some time to create a new pod on the EKS cluster, and will then run, and terminate the pod once completed.

Feedback

Feedback is certainly most welcomed! If you have any questions or comments, feel free to leave them and I will do my best to answer everything I can.

Written by: Brad Bonitatibus