Running a Multi-Tenant Airflow Cluster

Published in

Apache Airflow

7 min readJun 21, 2023

Apache Airflow is a versatile and well maintained open source tool for orchestrating big data and analytics workloads. In this blog post, we’ll talk about how BMG Rights Management runs a multi-tenant Airflow deployment.

At BMG we orchestrate a variety of workloads using Airflow, including royalty statement ingestion, royalty processing, marketing analytics, revenue assurance, financial analytics and reporting. If Airflow goes down, this results in data staleness for key parts of the organization. Our goal is to find the right balance between stability, operational overhead, and cost.

We work with GCP Cloud Composer instances and separate Airflow deployments by environment; one for test/dev, one for staging and one for production. Each of the three Airflow deployments is shared by 7 development teams and one analytics team.

In this blog post we will first list the benefits of a multi-tenant setup covering reduced cost and operational overhead before covering mandatory practices, guidelines and operational challenges. Lastly we take a brief look at continuous deployment and DAG testing.

Reduced Operational Overhead and Cost

Reduced operational overhead and cost are the main motivation for a multi tenant setup. When BMG migrated from an on-premise Hadoop Cluster that was shared by 4 development teams into GCP, it was quite an easy decision to go with a multi-tenant Google Cloud Composer setup: The teams were used to coordinating and sharing resources and it was already a blessing to use on-demand Dataproc clusters.

After three years, we’re still happy with this setup because it lets us hit our Service Level Agreements (SLAs). Other options, like self hosting Airflow on Google Kubernetes Engine (GKE) or introducing another vendor does not seem worth the cost or complexity.

To deal with the existing limitations with regards to isolation of DAGs and the monolithic architecture of Airflow, was to adopt some development practices that allow us to run a fairly stable environment with over 150 DAGs. In the following section we will focus on these practices:

BMGs Mandatory Development Practices
1. Data Processing Isolation
2. Usage of short-lived credentials
BMGs Recommended Development Practices
1. Atomic and Idempotent tasks with default retry settings
2. No external triggers
3. Tagging

BMGs Mandatory Development Practices

We require from each development team to follow these two basic rules:

Do not process data on the workers: Airflow workers do not provide full isolation of workloads. This means that there could be side effects by some workloads manipulating the worker configurations. (E.g. setting secrets in environment variables that should not be shared across different teams.) Furthermore, on Google Cloud Composer we have limited control over the worker resources. While workers can be auto-scaled, occasionally they might need to be restarted when moved to another node. On Composer we are not in control of the shutdown behavior. This can lead to a worker restart while a non-atomic, or non-idempotent task is running. We want to avoid this somewhat unpredictable behavior. Thus, moving data processing away from the workers, reduces risks of limited isolation and simplifies worker resource management. This furthermore simplifies cost allocation in the shared environment, covered below.
Use short-lived Credentials and Service Account Impersonation: We use GCP Service Accounts to manage permissions. Each team has its own Service Account associated with Airflow, and a dedicated GCP Project that contains resources like Dataproc and Google Cloud Storage. The Service Account associated with the Airflow Environment has no permissions beyond those needed to run Airflow and the permission to impersonate the team-specific service accounts.
In case the Kubernetes Pod Operator is used, each team has their own namespace with again a dedicated Kubernetes Service Account mapped to the teams specific GCP Service Account. This setup is a another prerequisite for granular cost allocation that we discuss in the section concerning operations.

BMGs Recommended Practices

Beyond the mandatory development practices we have some guidelines that should be followed:

Atomic and Idempotent tasks with default retry settings: I will not attempt to explain this topic further as anything I could write here would be less comprehensive than the DAG Writing Best Practices published by Astronomer.
Do not invoke DAGs outside of Airflow: This is a simple rule that ensures Airflow downtime does not lead to missed triggers. This makes scheduling maintenance windows much easier.
For trigger based scheduling we use two approaches:
a. Use a Message Queue: A message is published on a message queue topic (e.g. PubSub). A Sensor or helper DAG periodically checks for new messages and then triggers the workflow.
b. Make use of Data-aware scheduling that has been introduced in Airflow 2.4.
Tag your DAGs: Airflow allows to tag DAGs that make it simple to filter. We ask for each DAG to have the following TAGs:
a. Team name (indicating responsibility) and
b. Application name
c. Feature name
d. Processing frameworks used (k8s, spark, dbt, BigQuery and similar)

Beyond these practices we will now take a look into:

Operations, Maintenance Windows, Updates and Upgrades
Continuous Deployment to Airflow

Operations, Maintenance Windows, Updates and Upgrades

Configuration and Package Management: Keep It Simple

We install only the packages that are absolutely necessary. Because the processing is not executed on workers, there is not much need for additional packages. This greatly simplifies package management during environment upgrades.

On the configuration side, we make very few changes from the Google Cloud Composer defaults. The only two changes made are:

Setting the core.dags_are_paused_at_creation variable to True. This avoids DAGs to immediately start a backfill once they have been registered. Setting the default to False allows us to set the variable’s desired value via the API during the Continuous Deployment process (see Continuous Deployment below).
Setting webserver.instance_name to test, staging, and production to help developers to easily see which environment they are in. This is especially helpful with Google Cloud Composer as the given URL for the Airflow UI provides no human readable information about the instance name.

Cost Allocation

We separate Kubernetes workloads into different namespaces. With cost allocation enabled on GKE, along with separately dedicated GCP Projects for all other resources used, we can allocate the cost for all workloads accurately to the responsible teams.

Maintenance DAG

We use the Composer cleanup DAG to remove old logs and task instances regularly and keep our database running smoothly.

Maintenance Windows

While we have agreed on specific maintenance windows, reality sometimes overrules the plan. When that happens, we coordinate with all teams to find the best time for updates and upgrades. This procedure unfortunately becomes more complex with a growing amount of DAGs and might soon be one factor in favor of breaking up the instances.

Updates

To minimize the impact on production environments, we always keep at least two environments on the same version. During times when staging is ahead of production, teams can decide to deploy new DAGs at their own risk or wait until testing of the new version is completed for all existing DAGs across all teams and the production environment has been upgraded. This approach may delay the adoption of new features by a few days, but it reduces maintenance overhead and ensures high availability for critical DAGs.

Upgrades

For larger upgrades, for example the Airflow 1 to Airflow 2 migration, we spun up an additional 3 environments, one for test, one for staging and one for production.

In the following section we will take a brief look at our continuous deployment mechanism.

Continuous Deployment

We do not want to grant write access to all developers to the GCS bucket that is holding the Python DAGs for Google Cloud Composer to mirror them into the DAG bag. Instead, we are enforcing continuous deployment via git to promote DAGs from test to staging and eventually production.

To simplify this process as much as possible, we have translated the principles from Infrastructure as Code (IaC) into a lightweight framework that allows us to define DAGs, variables, and connections in yaml. This also allows multiple git repositories to be linked to the same Airflow instance while supporting the deletion of DAGs.

For example, a DAG can be expressed as follows:

- type: dag
  name: doployment-dag-resource-name
  properties:
    dag_id: dag-id
    source: path/relative/to/repository/root/dag_id.py
    location: dags/relative/file/location/in/dag/bag/dag_id.py
    is_paused: {{ deployment_internal_variable_name.is_paused }}

The Jinja templating allows for different values across different environments. This way, we realized Airflow resources as code while not introducing the full complexity of Terraform, Pulumi or any other Infrastructure as Code tool. Setting up a new pipeline for the teams is as simple as defining a Cloud Build YAML similar to the below:

steps:
- name: [region]-docker.pkg.dev/[project-id]/airflow-operations/airflowcd:latest
env:
- 'AIRFLOW_DEPLOYMENT_TEMPLATE=dags.yaml'
- 'AIRFLOW_DEPLOYMENT_VARIABLES=${_AIRFLOW_DEPLOYMENT_VARIABLES}'

DAG Testing

DAG testing presents frequent challenges. While tools like BREEZE, that allow to quickly spin up Airflow on a local machine via Docker Compose are a great help, and the changed airflow DAGs test command introduced in Airflow 2.5.0 will certainly also simplify DAG testing, it remains a challenge.

We allow our developers to impersonate the service account that is dedicated to their team and would be impersonated by Airflow for the testing and staging environments. This way, permission configurations can be tested.

We do not allow our developers to attach to worker nodes and simply execute their DAGs. This leads to a loss in efficiency during deployments as developers have to wait until the DAG appears in the Webserver to trigger and test it.

Closing Thoughts

In conclusion, running a multi-tenant Airflow cluster at BMG Rights Management has provided several key benefits, including reduced operational overhead and cost.

BMG has implemented mandatory and recommended development practices to help ensure stability and reliability. These practices emphasize data processing isolation, the use of short-lived credentials, atomic and idempotent tasks, avoiding external triggers, and tagging DAGs for easy filtering and identification.

In terms of operations, maintenance windows, updates, and upgrades, BMG follows a simplified approach to limit the installation of packages to only the essentials and make minimal changes to the default configuration. Cost allocation is achieved by separating Kubernetes workloads into different namespaces and using separate GCP Projects for other workloads. The maintenance DAG is utilized to clean up logs and task instances regularly.

One desired feature in Airflow or Cloud Composer is the ability to gracefully shut down workers when no tasks are running to ensure that no tasks are disrupted. This feature would further enhance the stability and simplify maintenance process of the multi-tenant Airflow cluster.

(Credits go to Viraj Parekh for his help on this article.)