Lessons in adopting Airflow on Google Cloud

Parin Porecha
Booking.com Engineering
8 min readApr 24, 2024
Photo by riccardo giorato on Unsplash

One of the responsibilities of the AdTech team in Marketing Science is to manage workflows that transfer data from marketing partners to Marketing Science’s data lake and back. These workflows and data in them are diverse, containing anything from the performance data of marketing campaigns to campaign parameters that need to be tweaked programmatically on the partner to data integrity checks that need to be run to capture data availability SLIs.

Managing these workflows was done through a workflow management system written in Perl and MySQL custom built for the team’s use-cases and deployed on-premise on baremetal boxes. As the team and its scope grew, this started to prove inadequate and the added overhead of required development for basic new features started to weigh on the team. In 2021, we decided to migrate these workflows to Airflow on GCP through their Composer offering. This allowed us to gain all the features out-of-the-box that we had to previously develop ourselves, the support of a solid open-source community and a multi-year reduction in our infrastructure age. A nice bonus was getting to code in Python and run that on the cloud.

As we onboarded all the different kinds of workflows to Composer over the last 2 years, we learned a number of lessons. This post is intended to share them with the wider programmer community in hopes that it will be helpful to someone who’s about to get on the same path.

Developer Experience

As the development on our Composer instance picked up pace, we realized that the developers in the team needed a way to iterate quickly in development and testing. Having independent Composer instances for each developer was a questionable investment, but we did need something like that which would not be shared and that would not have to go through the traditional deployment process.

The solution we adopted was to create a local Airflow environment that closely mimicked the remote production environment. Google builds Composer images by bundling Airflow releases with Python libraries. They do not publicly reveal the Dockerfile they use to build these images, but they do share the list of all Python packages that are a part of it. We built a custom Airflow docker image in a similar way by basing it on Airflow’s generic image, and adding these packages on top of it to reach parity with Composer. Authentication to on-premise services was handled through service account impersonation detailed in a section below. Here’s the Dockerfile that we ended up creating -

ARG AIRFLOW_VERSION=2.5.3

FROM apache/airflow:${AIRFLOW_VERSION}-python3.8

LABEL author="AdTech"
LABEL commit=${COMMIT}
LABEL AIRFLOW_VERSION=${AIRFLOW_VERSION}

COPY requirements.txt /home/airflow/requirements.txt

USER root

RUN apt-get update \
&& apt-get install -y \
build-essential \
pkg-config \
python3-dev \
musl-dev \
libc-dev \
rsync \
zsh \
findutils \
wget \
util-linux \
grep \
default-libmysqlclient-dev \
libxml2-dev \
libxslt-dev \
libpq-dev \
iputils-ping \
telnet

USER airflow

RUN pip3 install -r /home/airflow/requirements.txt --src /usr/local/src

The requirements.txt contains all the Python packages listed by GCP and locked to the same version as the ones in our Composer image -

The image created through this Dockerfile is then used to power an Airflow cluster that we spin up locally through Docker Compose for Airflow.

This approach came with some Pros and Cons:

Pros:

  • Ease of development has improved a lot and code that gets accepted to the staging environment is less buggy.
  • Centralized tuning ensures uniformity in development environments across the team.
  • Same docker image is now used in our CI pipelines enabling tests inside a production-like environment.

Cons:

  • The image needs at least yearly maintenance to keep it in-line with our Composer version upgrades.
  • GCP’s metadata server is not present in this local cluster, so custom code needs to be written to handle that.
  • Our GCP project’s network and firewalls are independent of the local machine’s network and that disparity can’t be fixed.

Performance

  • celery.worker_concurrencyThe default value (16) turned out to be too high for us and led to the scenario where tasks got scheduled on a worker but because the underlying resource was busy they did not get executed. Since those tasks were not queued anymore, Composer believed everything was fine and did not deem it necessary to create a new worker instance through auto-scaling.
CPU usage hitting the limits but the number of instances (yellow) staying the same

The solution was to reduce it so that tasks would remain in queued state until a worker became available to execute it. Setting it to a very low value resulted in new workers being created without getting any tasks scheduled, so we iteratively increased it to the average of minimum and maximum of workers set in auto-scaling * number of vCPUs in each (8). This has ensured that there’s always spare capacity available while keeping costs in check. This can be further tuned to reduce costs by reducing vCPU and/or memory per worker so that the total usage remains about 50–70% of the limit.

Auto-scaling being triggered in both directions based on actual CPU usage
  • scheduler.min_file_process_interval, core.dag_file_processor_timeout and core.dagbag_import_timeoutThe value of these parameters as set by the default configuration (30 seconds) caused our Scheduler to always stay near 100% CPU utilization causing tasks to stay in queued state and an ever-present The scheduler does not appear to be running warning whenever we visited the Webserver UI. After looking at our development patterns we realized:
    - We have a lot of DAGs.
    - A lot of them are heavy and contain about 150–200 tasks.
    - We do not add new DAGs frequently as the adoption has matured.
    So we increased the refresh values to 180–240 seconds to give enough time to the Scheduler to parse them, but not do it so frequently. This has made it possible for us to keep Scheduler resources limited but still be able to support ~35 DAGs at 50–70% utilization:

Heavy Lifting through Dataproc

Airflow as a data processor is immensely powerful, but it is still limited by constraints put on by the underlying compute. Horizontal scaling can help achieve the desired performance but it cannot substitute for vertical capacity requirements (memory/CPU/storage). Say there’s a DAG with tasks that need double the memory that a worker possesses but this DAG only needs to run daily. We can increase the worker node’s capacity but that means it will remain idle most of the day costing us that spare capacity. Furthermore this increase will be applied to every new worker instance every time autoscaling kicks in even though it is unnecessary.

So instead for one such DAG of ours, we rely on Dataproc (GCP’s managed Spark) to do the actual computation, and Airflow only behaves as an orchestrator for it. This is how it works in practice:

The DAG creates a Dataproc cluster programmatically using DataprocCreateClusterOperator, tasks submit jobs through DataprocSubmitPySparkJobOperator and wait for them to complete. When all the jobs have been processed, the cluster is deleted through DataprocDeleteClusterOperator. Since all these operators are built-in no custom code needed to be written to achieve this.

The resources needed by the DAG are minimal as it is only co-ordinating the Dataproc cluster, and since the lifespan of that cluster equals to the runtime of the DAG the costs stay low. There are some teams in Booking that run this setup at a much bigger scale. Airflow is getting better with every version at supporting this use-case and since version 2.5 there’s a direct link to the created Dataproc resource in the Airflow UI:

Embedded documentation in DAGs

Booking.com like any other organization of this scale suffers from knowledge discovery issues. Documentation, Code, Monitoring dashboards, Data lake all existing on different platforms make it difficult to interlink knowledge and have a central overview of it. But Airflow has some kind of solution against this in the form of a lesser-known but powerful feature which shows documentation written in DAG code in the UI itself. It supports Markdown as well making it possible for those docs to be rich.

Example docs in a DAG file
Same docs being rendered in Airflow UI

Even better is the functionality to render docs that are separate from the code -

with DAG(
dag_id='example_dag',
schedule_interval="@daily",
default_args=default_args,
doc_md="./docs/example_dag_README.md"
) as dag:

Service Account Impersonation for security

A team that needs programmatic access to the Composer instance needs access to the service account linked to it. For us this created distribution and maintenance issues in sharing the service account’s key. We would have to share a common key with all members of the team creating a risk of leak or each one of us would have to create their own key multiplying the risk. The rotation of those keys would pose a co-ordination challenge of its own. This was not a problem we wanted to have.

Not the most ideal way to access the SA

The solution we adopted was to allow the team’s GCP IAM group to impersonate the service account by requesting short-lived credentials. It requires the caller (team member) to authenticate themselves first through the standard login window instead of a key, and when authenticated they are granted the same rights as the service account they’re allowed to impersonate. The rights to impersonate are given by granting the IAM role Service Account Token Creator on the target SA -

Team’s group granted Service Account Token Creator permission on the SA

Using this functionality in code required just refreshing the Application Default Credentials before using a client library. We decided to do it through gcloud at the beginning of the local developer environment bootup:

Makefile step that refreshes ADC, can be optimized to do periodic refreshes as well

Migrating to Airflow on Google Cloud Platform brought significant improvements in our workflow management. These learnings gained through iterative development, tuning and refining helped us build a more agile and reliable infrastructure, and that in turn made it possible for us to achieve product impact that would not have been possible in the older on-premise setup.

If you’d like to work on exciting challenges as these, check out our careers page.

--

--

Responses (2)