EXPEDIA GROUP TECHNOLOGY — DATA
Airflow — Scheduling as Code
Examples of using Apache Airflow to solve complex ETL scenarios
Introduction
In the Marketing Data Engineering team at Expedia Group™ we use Apache Airflow for running our ETL tasks. Airflow is a scheduling tool we use to define our workflows, its dependencies and timings of when we want them to run. The real power of Airflow comes from the fact that everything is code. The workflows, a.k.a. DAGs (Directed Acyclic Graphs), are all defined as Python scripts. It is a “Scheduling as Code” tool. The idea that we can control our schedules with Python code gives a lot of flexibility in defining complex logic in Airflow. Moreover, we work on big data workloads where the jobs are “horizontally scalable”. We have used Airflow to divide our big ETL tasks, used date based daily and full backfills and some other techniques in our team to utilize its power. Below are some of the problems that we have solved using Airflow.
Problem 1 — Full data refresh within daily SLAs
One of our internal products, called Marketing cube, joins data from all our advertising partners (Google, Facebook, etc) with our core business and attribution data. We do a daily refresh of the last 30 days and a weekly full refresh for the last 2 years of data. Our problem was to perform this full refresh on Sunday within the daily SLAs.
One way was to increase the Spark executor memory, number of executors and let the same job handle the 2 year volume. This did not scale as the volume was 25 times larger and soon we started getting out of memory errors. Moreover any intermediate executor failures caused the failed task to rerun (and waste critical time). We needed something that would scale horizontally rather than vertically.
Since our application logic is the same but the volume is significantly higher for the weekly refresh, we decided to use Airflow to split our tasks in the weekly workflow into date range slices (of four or six months each). The selection of the date range was a test and learn activity depending on data volume, Spark executor memory, number of executors, and run time we were comfortable with (considering the SLA) for each task we were splitting.
As shown in the 2nd screen print below we settled on 8 parallel tasks of 3 month windows. In future, if our volumes increase further, we can easily change it to run more parallel tasks with smaller duration. So using Airflow we were able to meet daily SLAs for the full refresh and also future proof the scheduling to handle any increased volumes.
Problem 2 — Handle upstream data delays
The Marketing cube provides a view of Marketing Spends for all the sources (Google, Facebook, etc). We have numerous external sources that we consume data from. The problem was that every now and then one or more of these may have delays. So we wanted our process to not wait for the data from all of them. Instead, we wanted to design the workflow to process a source as soon as one was available.
To solve this use case we used the source as the partition column in our output Hive table. This partitioning enabled us to run these sources independently whenever the upstream raw data was available for that source. Airflow has a File Sensor operator that was a perfect fit for our use case. Each leg of the workflow started with a file sensor. As soon as upstream source data was available that part of the workflow would run (2nd screen print below). This enabled our consumers to start getting our data sooner.
Just like file sensor, Airflow also has a Hive Partition Sensor. One can use this to wait for a Hive partition to be available (similar to waiting on a file in file sensor). Using these sensors enabled us and our downstream consumers to start their process as soon as data becomes available.
Problem 3 — Monthly full download
The data from one of our Partners is re-downloaded every month on the first Thursday. When it happens we do not want that day’s daily downstream jobs to wait for this full download. Instead we want to download it once that day’s daily data is processed (for downstream consumption). This gives us a full day to download the entire historical data.
One option was to create a new workflow for monthly download. But instead of that we decided to reuse our daily workflow and added a branch for full download. We have used Airflow’s Short Circuit operator to bypass this branch everyday except for the first Thursday of the month. This added a conditional logic in the workflow, running a part of it only when a business condition is met. Its like a if/then statement in your schedule!
Problem 4 — Full download but with data range limitation (combining problem 1 and 3)
The data from one of our other partners is re-downloaded every month on the first Thursday. But they have a limitation that we can’t download data for more than 30 days in one API call. So we combined techniques from Problem 1 (split into 30 day window) and Problem 3 (use short circuit to bypass every day except for the first Thursday of a month) to solve for this limitation. As shown in the 1st screen print below, if it is the first Thursday we start downloading the entire data in monthly chunks. And as time progresses, Airflow will increase the number of chunks depending on how far back we want to download.
Just like a programming language where we use multiple different statements to create a logical program, we started using these Airflow operators and were able to create workflows with business rules/logic in them.
Conclusion
With these examples, we can see “Scheduling as code” can be very flexible. With the power of a programming language we can define workflows with complex business conditions. We can easily scale existing code by splitting it into parallel tasks with smaller run windows. With a small change we can increase the number of splits and handle any increased volumes in future.
Since all of this is code, we can track our scheduling changes using source control (like git) and auto-deploy the changes with a CI/CD pipeline. Gone are the days where we had an operation team update the scheduler manually using a Release form.
Airflow also provides many other out-of-the-box operators that we have not yet explored. We are confident that with Airflow as our Scheduler and Python as its code we will be able to handle any kind of scheduling problem that can come our way.