Production-grade Airflow alerting in 5 minutes with Databand

John Blust
Databand, an IBM Company
8 min readJul 20, 2021

Setting up Airflow alerting is key to creating a scalable data organization. Especially as consumers demand higher quality data more consistently. While engineers are great at spotting issues when they’re looking, they can’t be looking all the time. That’s true no matter how many dashboards you have, how many engineers you hire, and how much you streamline your architecture.

Good alerting is like an insurance policy. It gives your engineers time to remediate before your data SLAs are missed and unhealthy datasets are utilized by consumers. Having good alerting coverage for your long-running, critical Airflow workflows ensures your engineers know that something has happened that could affect data health, as soon as it happens.

Luckily, Databand makes that pretty easy to set up. In this guide, you’ll learn how to set up production-grade alerting for your key Airflow pipelines in 5 minutes or less with zero changes to your existing pipeline code. (Sounds too good to be true? Our Airflow integration is pretty nice.)

This guide consists of four steps; those are:

  1. Identify a critical Airflow pipeline
  2. Identify a key metric
  3. Set alert definition
  4. Connect alert receiver

It’s as straightforward as it sounds. Before we get started, let’s touch on a question you might have.

Why not use Airflow’s default alerting functionality?

Airflow is great at doing what it was made for; orchestrating tasks. Airflow also has nice built-in functionality for monitoring and alerting. As flexible as Airflow is, it wasn’t built to be data-aware.

All of that to say, there’s no straightforward way to integrate Airflow into your organization’s incident management process. You can set alerts in Airflow, but the default alert receiver would be email. That works if you use email, but if you’re like most people, alert emails can get lost in a noisy inbox. Not only that, but it’s hard to tell whether the alert you are receiving is of critical importance or an issue that can wait until you’re done with your current task.

Databand solves those problems

The Databand Dashboard makes it easy to:

  • Set production-grade alerting on your Airflow environments fast
  • Integrate your alerts with your current incident management systems
  • Get flexible alerting definitions with Anomaly Detection

Databand offers three out-of-the-box critical metrics so you can set alert definitions with no-code configuration. This will get you tracking the most important metrics on your Airflow workflows right away. Those metrics are:

  1. Run Status–What state is my Run in right now (Running, Successful, Failed, Queued, etc.)?
  2. Run Duration–How long has my pipeline been running/taken to run?
  3. Delay Between Subsequent Runs–How long is the schedule interval between jobs?

All of these metrics work together to answer the larger question: Is your data uptime being met?

For this guide, you’ll see how to set a production-grade alert in Airflow with no code in less than 5 minutes on Run Duration (but the process for the other two metrics is essentially the same). Now, let’s get started.

1. Identify your critical Airflow pipeline

Deciding which pipelines deserve alerts overlaps with a lot of questions you should be asking yourself when setting your organization’s data SLAs.

What is the purpose of a given pipeline? How important is the timeliness & freshness of the data delivery? What happens if the data isn’t delivered when expected?

For example, a pipeline that delivers data for an external consumer’s dashboard might have a higher impact on your business than a pipeline that delivers data for a nascent data science experiment. Now, both pipelines might have enough importance to deserve an alert, but the severity of those alerts can fall into four categories in Databand:

  • Critical — The house will burn down if this doesn’t get fixed ASAP.
  • High — A 2 am service call isn’t necessary, but you’ll need to fix it during normal working hours.
  • Medium — This issue’s impact on the business isn’t huge, but it will need fixing eventually.
  • Low — This issue’s impact on the business is debatable, but as long as things are quiet, you can look into it.

Referencing our previous examples, the pipeline that delivers data for an external consumer likely has a strict SLA around it. Tangible profits/losses are realized depending on whether that SLA is met. That would be classified as a critical impact pipeline, and thus requires a critical severity alert. The pipeline for the nascent data science experiment has importance for the organization’s R&D efforts but hasn’t proven profitability. So, a pipeline like that one could fall into the medium or low impact category — depending on how much buy-in the team has gotten from leadership.

At this point, you have an idea of a critical pipeline in your system that needs complete alerting coverage.

2. Identify a key metrics

What metrics matter the most to the pipeline depend–once again–on the purpose of the pipeline, the consumer, and its importance. What you should be optimizing your SLAs and Airflow pipelines for can be determined by an Iron Triangle exercise.

Each point of the Iron Triangle represents a specific aspect of a data quality SLA, and they all come together to represent a mutually agreeable definition of “good” data quality for that said pipeline. When you optimize your pipeline for one aspect of your SLA, it begins to “pull” at the others, moving the definition of “good” data quality away from the center.

Simply put, data deliveries adhere to the old adage of: “Fast, good, and cheap. Pick two.” If that external-facing dashboard needs to have accurate data quality delivered every day by 4 am, then the engineering costs required to ensure that delivery is met on time will be high. In that case, setting up an alert on Run Duration, a leading indicator of whether your pipeline is on track to deliver by 4 am, would be a good decision.

The hard part is over now. We’ve identified a critical pipeline and a key metric. Setting the alert and integrating it to your receiver is simple in Databand.

3. Set the alert definition

First things first, log into Databand.

databand dashboard

You should be on the main “Dashboard” page that looks something like this. Over in the left-hand corner, you will see your main menu.

databand pipeline monitoring

Click on the “Pipelines” tab. This gives you a scannable view of all your pipelines. You can easily filter your pipelines using the Projects drop-down menu or the search bar. In this case, you know the name of your pipeline so you can search for it.

Click on your critical pipeline and then click on the “Add Alert Definition” button in the top right corner.

This will bring you to the alert customization menu. Here, you can set your alert logic, name it, and set its severity level. The key metric we identified earlier was Run Duration, so you’ll click that metric.

Next to your default metrics, you’ll see your options for defining your alert logic. You can use traditional logic operators like “equals” (==), “does not equal” (!=), less than (<), and more.

In this situation, we’ll be using the Anomaly Detection setting. This will allow you to cut down on your time spent manually checking and adjusting the alert thresholds. Databand’s Anomaly Detection uses machine learning to automatically adjust the threshold based on the pipeline’s historical performance and the sensitivity settings you chose.

This is important because trends in pipeline metrics can change over time due to a variety of factors we won’t get into right now. But the goal of this feature is to reduce alerting fatigue by allowing for some flexible dynamic to traditional alerting logic.

In just a couple of minutes, you’ve configured a critical alert that affords you 24/7 dynamic coverage on an important pipeline. Now, you need to integrate the alert into your standard receiving systems.

4. Connect the alert receiver

As things stand, you’ll receive a notification in Databand in the case of an alert firing. But like we mentioned before, you can’t have an engineer staring at that screen all day. You need to integrate the alert into your organization’s regular alert receiver. Luckily, Databand makes that easy, too.

Databand supports integrations with email, Slack, Opsgenie, Pagerduty, and other custom receivers. Databand’s default receiver is Slack, so we’ll be using that for this example.

To set the alert receiver as Slack, click on your User Profile in the button right corner. Then, click the “Settings” button.

Navigate to the “Alert Receivers” tab and click the “Add New Receiver” button in the top right corner.

In this menu, you can name your receiver, insert the Webhook URL with your associated Slack Group, and select which channel you want the alert to be pushed to.

After you’ve done so, you can click the “Add” button in the bottom left corner.

All done.

That’s in. Databand allows you to set production-grade alerting on your critical Airflow environment in less than 5 minutes.

To sum it all up, Databand solves multiple problems for you in this scenario:

  • You can guarantee shorter time-to-remediation with critical severity alerts that are compatible with your organization’s incident management processes.
  • You can set up alerting in your Airflo environments fast.
  • Cut down on your engineer’s time spent manual monitoring and alert adjusting with Anomaly Detection

All of that with no code required.

Want to learn how to set up production-grade alerts on custom metrics like Data Input Size, Data Output Size, and Record Count?

Stay tuned for the next edition by subscribing to our publication.

Need coverage on your critical assets right now? Book a product demo or start your free trial to get started with Databand.ai.

Want to learn more about the Databand.ai? Request a demo with one of our experts at https://databand.ai/request-demo/

--

--