EXPEDIA GROUP TECHNOLOGY — DATA

Airflow — Scheduling as Code

Examples of using Apache Airflow to solve complex ETL scenarios

Neeraj Prithyani

Published in

Expedia Group Technology

6 min readApr 6, 2021

Alarm clock on a table — Photo by Aphiwat Chuangchoem from Pexels

Introduction

In the Marketing Data Engineering team at Expedia Group™ we use Apache Airflow for running our ETL tasks. Airflow is a scheduling tool we use to define our workflows, its dependencies and timings of when we want them to run. The real power of Airflow comes from the fact that everything is code. The workflows, a.k.a. DAGs (Directed Acyclic Graphs), are all defined as Python scripts. It is a “Scheduling as Code” tool. The idea that we can control our schedules with Python code gives a lot of flexibility in defining complex logic in Airflow. Moreover, we work on big data workloads where the jobs are “horizontally scalable”. We have used Airflow to divide our big ETL tasks, used date based daily and full backfills and some other techniques in our team to utilize its power. Below are some of the problems that we have solved using Airflow.

Problem 1 — Full data refresh within daily SLAs

One of our internal products, called Marketing cube, joins data from all our advertising partners (Google, Facebook, etc) with our core business and attribution data. We do a daily refresh of the last 30 days and a weekly full refresh for the last 2 years of data. Our problem was to perform this full refresh on Sunday within the daily SLAs.

One way was to increase the Spark executor memory, number of executors and let the same job handle the 2 year volume. This did not scale as the volume was 25 times larger and soon we started getting out of memory errors. Moreover any intermediate executor failures caused the failed task to rerun (and waste critical time). We needed something that would scale horizontally rather than vertically.

Since our application logic is the same but the volume is significantly higher for the weekly refresh, we decided to use Airflow to split our tasks in the weekly workflow into date range slices (of four or six months each). The selection of the date range was a test and learn activity depending on data volume, Spark executor memory, number of executors, and run time we were comfortable with (considering the SLA) for each task we were splitting.

As shown in the 2nd screen print below we settled on 8 parallel tasks of 3 month windows. In future, if our volumes increase further, we can easily change it to run more parallel tasks with smaller duration. So using Airflow we were able to meet daily SLAs for the full refresh and also future proof the scheduling to handle any increased volumes.

Airflow workflow showing task that runs daily, with a 30 day run window — Daily Task — runs for 30 days window

Airflow workflow for weekly run. Same code now runs in 8 parallel tasks, each processes 3 months of data. — Weekly Task — Same code, runs in 8 parallel tasks with 3 month windows each

Python code for Airflow showing how we calculate date range for the 8 parallel weekly tasks. — Python Code — split into 8 slices of 3 months each

Problem 2 — Handle upstream data delays

The Marketing cube provides a view of Marketing Spends for all the sources (Google, Facebook, etc). We have numerous external sources that we consume data from. The problem was that every now and then one or more of these may have delays. So we wanted our process to not wait for the data from all of them. Instead, we wanted to design the workflow to process a source as soon as one was available.

To solve this use case we used the source as the partition column in our output Hive table. This partitioning enabled us to run these sources independently whenever the upstream raw data was available for that source. Airflow has a File Sensor operator that was a perfect fit for our use case. Each leg of the workflow started with a file sensor. As soon as upstream source data was available that part of the workflow would run (2nd screen print below). This enabled our consumers to start getting our data sooner.

Just like file sensor, Airflow also has a Hive Partition Sensor. One can use this to wait for a Hive partition to be available (similar to waiting on a file in file sensor). Using these sensors enabled us and our downstream consumers to start their process as soon as data becomes available.

Hive table partitioned by source column. This enables us to run each source independently in Airflow. — Hive Table — Partitioned by individual sources

Airflow workflow showing file watchers at the start of each leg of the workflow. As soon as one source data is available that part of the workflow will start running. — Airflow — using file watchers and running sources independently

Python code for file watcher that checks for upstream file every 15 minutes, with an overall timeout of 11 hours. — Python Code — using Qubole File Sensor

Problem 3 — Monthly full download

The data from one of our Partners is re-downloaded every month on the first Thursday. When it happens we do not want that day’s daily downstream jobs to wait for this full download. Instead we want to download it once that day’s daily data is processed (for downstream consumption). This gives us a full day to download the entire historical data.

One option was to create a new workflow for monthly download. But instead of that we decided to reuse our daily workflow and added a branch for full download. We have used Airflow’s Short Circuit operator to bypass this branch everyday except for the first Thursday of the month. This added a conditional logic in the workflow, running a part of it only when a business condition is met. Its like a if/then statement in your schedule!

Airflow workflow showing a short circuit operator. The downstream tasks will only run on first Thursday of a month. On all others days this branch is bypassed. — Airflow — bypass a branch using short circuit operator

Python code for Short Circuit Operator that checks for first Thursday of the month. — **Python code to check first Thursday of a month**

Problem 4 — Full download but with data range limitation (combining problem 1 and 3)

The data from one of our other partners is re-downloaded every month on the first Thursday. But they have a limitation that we can’t download data for more than 30 days in one API call. So we combined techniques from Problem 1 (split into 30 day window) and Problem 3 (use short circuit to bypass every day except for the first Thursday of a month) to solve for this limitation. As shown in the 1st screen print below, if it is the first Thursday we start downloading the entire data in monthly chunks. And as time progresses, Airflow will increase the number of chunks depending on how far back we want to download.

Just like a programming language where we use multiple different statements to create a logical program, we started using these Airflow operators and were able to create workflows with business rules/logic in them.

Airflow workflow showing Short Circuit operator with a branch that has parallel tasks each processing 30 days of data. — Bypass a branch and also split into 30 day window

Python code showing how we split the tasks in 30 days window, going back maximum of 2 years or 1st March 2020 whichever is latter. — Python code to split into eligible 30 day windows

Conclusion

With these examples, we can see “Scheduling as code” can be very flexible. With the power of a programming language we can define workflows with complex business conditions. We can easily scale existing code by splitting it into parallel tasks with smaller run windows. With a small change we can increase the number of splits and handle any increased volumes in future.

Since all of this is code, we can track our scheduling changes using source control (like git) and auto-deploy the changes with a CI/CD pipeline. Gone are the days where we had an operation team update the scheduler manually using a Release form.

Airflow also provides many other out-of-the-box operators that we have not yet explored. We are confident that with Airflow as our Scheduler and Python as its code we will be able to handle any kind of scheduling problem that can come our way.