Airflow Best Practices

Matt Weingarten
5 min readApr 18, 2022

--

I touched on a few Airflow best practices in my post here, but since Airflow (or a service like Airflow) is such a critical part of the modern data stack, I wanted to go into more detail and add a few more tips to get the most out of your orchestration tool.

The famous pinwheel

Monitoring

Yes, the Airflow UI is an easy place to track the progress of DAGs and to see error messages when they occur. On top of that, I recommend logging messages to Slack along the way in your processing, so that if you’re not actively looking at the UI, you’ll still be able to see when failures occur. SlackOperator is very easy to use so long as you set up the connection properly.

Make sure that messages contain as much relevant context as they can (environment, date being run, etc.).

SLAs

SLAs in Airflow aren’t perfect as they track the time from the start of DAG execution to the end (in the case of a DAG-level SLA) or the completion of a specific task (in the case of a task-level SLA). Ideally, the task-level SLAs should focus on the runtime of a specific task as opposed to being based on the DAG’s execution time up to that point. There are efforts to fix this and hopefully they come into the picture eventually.

Regardless, I do recommend setting SLAs in the same vein of monitoring above. It’s good to know when tasks are running for longer than expected so it can be ensured that data will be delivered. If there isn’t a mandatory SLA, I recommend setting one anyway (perhaps an hour longer than a task’s general runtime or something of the sort). I generally use task-level SLAs over DAG-level SLAs (with the expectation that it will actually be resolved one of these days).

Trigger Files

DAGs run on a schedule and will attempt to do their first action as soon as they fire up. In the case of processing data, it’s best to make sure the data is actually there before attempting to process it. In reality, data can be late for a litany of reasons, and there’s no reason to launch an EMR cluster in attempts to run a Spark job when it’s going to fail and you’re going to have to resubmit the task.

Trigger files allow you to be able to use a sensor for the data before running any subsequent tasks. Whether it’s a simple .txt file or an _SUCCESS file in the case of Parquet data, having some mechanism in place to not process until the processing is ready to go removes a later burden. That’s the key takeaway here: don’t process until you can process.

Tagging

When there are a lot of DAGs on a server, it might be difficult to scroll and find the one you’re interested in looking at. Tagging helps clean up this clutter, as you can filter for a specific tag in the UI and have a much easier time looking at what you need.

Optimizing Dependencies

If multiple tasks can run in parallel, let them. This will make for a faster DAG runtime instead of processing everything sequentially. You can even extend this to EMR, which now allows for multiple jobs to run in parallel.

Incident Management

As an orchestration service, Airflow basically handles everything for your processing needs. As a result, it’s important to track any failures and SLAs accordingly. You can use the callback functions (sla_miss_callback and on_failure_callback) with proper methods that raise an incident. If you’re using a tool like PagerDuty or Opsgenie, you can store the API key as a variable and then use the API to call an incident so it’s routed appropriately.

Airflow CI/CD

Making changes to various parts of the Airflow infrastructure can be handled in an automated fashion with proper CI/CD. In the basic example of Airflow running on an EC2 server, you can SSH into the server and replace the DAGs folder with the newly-changed DAGs. Currently, our team handles all of these changes via Jenkins pipelines. DAGs will be redeployed to the proper environment once a job is triggered, with separate jobs allowing for the creation/update of variables and connections. We also make a point in using a dev branch for the Dev environment and tags for the Prod environment, so it’s easy to restore a previous state if all doesn’t go to plan.

Automate All The Things!

If your DAG involves creating a cluster for processing, use a tool like Terraform to handle that creation so that clusters are only running when they need to run. Make sure that clusters are properly tagged to show up in the proper cost-tracking tool (this is an easy place to forget to apply said tags).

Any shell scripts for handling partition updates or the like should also be automated so that every part of the processing runs through the DAG. Make sure that if these scripts are stored on S3, that the execution role has the proper permissions to access them.

Handling Ad-Hoc DAGs

Most DAGs run on some normal cadence and are easy to schedule/use. However, there are occasions where DAGs should be ad-hoc and only run when needed. For example, we once had a use case of running a DAG when we received reference data, which generally came monthly but wasn’t always guaranteed to do so.

How we chose to handle this was through a Lambda function. When the data arrived in S3, our Lambda would call the Airflow API and trigger the DAG (this is a great choice now since the API is out of experimental mode). It’d take whatever metadata it needed from the S3 file and put it into the payload for sending to the DAG. Perhaps I’ll go into more depth into this in a different post.

Conclusion

There’s a bunch of best practices when it comes to Airflow, and chances are you’ve seen plenty from other posts here or from other good sources of information. I decided to go into more detail on some of my mind, and am happy to discuss any others.

--

--

Matt Weingarten

Currently a Data Engineer at Samsara. Previously at Disney, Meta and Nielsen. Bridge player and sports fan. Thoughts are my own.