Data Aware Scheduling: A Case Study of Custom Ink

Published in

Apache Airflow

6 min readApr 10, 2024

By: Tom Stark, Michael Peteuil

The introduction of data-aware scheduling in Airflow 2.4 dramatically changed how workflows can be orchestrated. This article will outline how this new feature reduced our Mean Time to Remediation (MTTR), lessened the complexity associated with managing DAG issues, and chopped our total daily pipeline run time duration.

Custom Ink’s mission is to help people create a stronger sense of community by enabling them to design and order custom t-shirts and gear for their clubs, companies, charities, family reunions, and more. We leverage data for a variety of use cases, including marketing, logistics, customer relationship management, financial reconciliation, and more. The data team has always been a tiny fraction of the total tech team at Custom Ink, with just a handful of engineers and analysts. Because our team is small and Airflow is our primary orchestrator, the performance of our Airflow deployments is paramount.

Old environment

The data processing infrastructure at Custom Ink started simple with just EC2 instances running shell scripts on cron schedules. After years, this approach showed its shortcomings and we migrated that setup to Airflow after testing an Airflow 1.7.2 deployment with Python 2.7 around September 2016 on a single AWS EC2 instance. Fast forward to today and our team is now running Airflow 2.7.3 on Astronomer. Our Airflow instance has approximately 350 DAGs, most of which primarily run once a day. The bulk of these DAGs are either extracting data from internal databases and third party sources, pushing the results to AWS S3, or running SQL to build tables in AWS Redshift or Google BigQuery.

The vast majority of our DAGs followed one of the following patterns:

DAGs called rawdata DAGs pull from sources, dump to AWS S3, and create tables in AWS Redshift.
A custom DAG factory creates individual DAGs for fact and dimension tables which identify dependent upstream tables based on a dependency graph generated by our CI/CD processes. These tables maintain proper build ordering based on ExternalTaskSensors which follow a canonical naming convention which aligns a table name with a task name.

Example of the old DAG pattern. wait_for_ tasks were created with the ExternalTaskSensor.

Challenges with old patterns

One of the challenges with this arrangement was the difficulty in remediating issues when DAGs failed. In these circumstances, the DAG would normally retry several times while downstream DAGs began to start. The ExternalTaskSensors that pointed at the failing tasks upstream would also retry, but in the process they would consume pool slots and eventually timeout and fail. Gradually, the best practice evolved into turning off all DAGs, clearing failed jobs one at a time, and re-enabling DAGs incrementally — which was a time consuming process.

A second challenge had to do with DAG schedule management. In our old setup, it was very difficult to manage schedules effectively when there were cross DAG dependencies. We were constantly adjusting schedules based on the typical completion time of upstream DAGs. Normally, this occurred after consumers of data recognized that there were some data discrepancies or tables built out-of-order and we would usually make several schedule adjustments to start downstream jobs later in the event there were some delays upstream. While we had some tooling to flag where there may be scheduling issues, this was still a difficult process to manage.

Data-aware scheduling adoption

After data-aware scheduling was released in Airflow 2.4, we quickly upgraded our platform with the intent of leveraging the new feature. While DAGs in our rawdata layer would continue to be based on cron schedules, the plan was to remove most ExternalTaskSensors in other DAGs, replacing their functionality by scheduling with Datasets. This process was completed by repurposing our same dependency graph and naming conventions that we used prior.

In practice, what this looked like was instead of having ExternalTaskSensors named wait_for_rawdata_mysql_table_1 and wait_for_rawdata_mysql_table_2 which were upstream dependencies for the PostgresOperator that would build our factdata.orders table, we decided to remove the ExternalTaskSensors. We now leverage a list of Datasets in the DAG schedule, which are updated by the same tasks the ExternalTaskSensors were previously polling. When all Datasets have been updated, the PostgresOperator is triggered and our table is built without needing sensors. While we have deviated from best practices related to Dataset naming, our team decided this would be the easiest way to adopt the new feature with the intent of refining this in the future to use a proper URI format.

How did data-aware scheduling help with these issues?

We immediately started seeing dramatic improvements in our MTTR after this work was deployed. When we encountered issues, we no longer had to scramble to shut off DAGs and clear tasks. Instead, this new feature allowed us to focus exclusively on the issue, get the task into a state where it could complete successfully, and Airflow would handle the rest. Since our downstream jobs no longer have ExternalTaskSensors, DAGs now kick off in order when their respective Datasets are pushed. While we have incomplete data, estimates suggest that this change reduced our issue remediation time by 30–50%.

Deploying new DAGs became much easier as well. Scheduling DAGs is now only necessary if new rawdata jobs are deployed since all other fact and dimensional tables are built off of that layer. For all DAGs that are downstream of rawdata we no longer need to worry about trying to find the right schedule and instead we have those DAGs listen for the appropriate Datasets.

While these were the primary two driving reasons behind adoption of data-aware scheduling, another benefit was the improvement of our overall daily DAG run time. In our old environment, much of our task execution time was consumed by our ExternalTaskSensors trying and retrying, waiting for upstream tasks to complete. With the move to data-aware scheduling, all of that wait time was suddenly gone. That alone caused our core factdata.orders table to land 2–3 hours earlier on average.

Lessons Learned

One of the biggest challenges with the adoption of event-driven scheduling was change management. When data-aware scheduling was rolled out, most Airflow users were still in the habit of clearing DAGs en masse to fix issues when they arose. DAGs were also being cleared to test when new code was deployed to production. In both of these circumstances, downstream DAGs would start, pushing Datasets which are used for the next DAG run the following day. This is a problem for us because we require that tables be built in a strict order. If DAGs run prematurely, this can cause missing or inaccurate data the following day — which we had to manage several times while data-aware scheduling was being adopted. Eventually, we developed tooling to stop Datasets from being pushed in instances where users need to test or re-run tasks, and team members stopped re-running completed DAGs like they did in the past. After additional training and a few practice cycles, these problems seldom arise anymore.

Alerting was also a challenge for us in this new arrangement. Previously, ExternalTaskSensors would act as a canary-in-the-coal-mine whenever we experienced problems. When tasks would time out, hang, or run much longer than expected, ExternalTaskSensors would inevitably fail and we would get alerted. Without these sensors, there were some instances where problems slipped through the cracks and we were not alerted. To close this gap, we took advantage of cron-scheduled DAGs which check to see if certain DAGs have been successfully completed at least once when they run. Those checks are similar to the Absolute Time checks offered by Astronomer. There have been circumstances where these checks have resulted in false positives, but overall they’ve helped safeguard us against instances where DAGs fail silently.

Conclusion

For Custom Ink, data-aware scheduling has been a welcomed addition. For legacy Airflow deployments, there may be some retrofitting required to incorporate event-driven scheduling. However, in circumstances like ours where you have many cross-DAG dependencies, it can result in several positive outcomes.