3 Steps to Advanced Alerting on Airflow with Databand

Published in

Databand, an IBM Company

7 min readMay 5, 2020

The role of a data engineer has become critically important to the success of data-driven organizations, and Airflow has become a standard in the data engineering toolkit. It’s become so important that huge organizations like Adobe, Walmart, Twitter, and more, have actively contributed to Airflow’s development. That’s because these organizations have invested record amounts in their data infrastructure and the creation of data products.

While the demand for engineers is high, the ideal state of a data engineer is that of anonymity. Why? Because end-users only realize the significance of the data engineer’s job when there’s a data quality problem.

A data engineer’s primary function is to create trust in their data. As a data engineer, your goal should be creating data architecture that fosters data production so accurate and efficient that your Slack and email address become forgotten by your data consumers.

You need to be able to see issues in your pipeline performance before they affect data delivery. You need to proactively solve problems so that your data SLAs aren’t at risk.

Monitoring for Airflow and 20+ other tools on Databand.ai

Databand.ai is a unified data observability platform that helps you identify and prioritize the pipeline issues that impact your data product the most.

The Databand Platform tracks pipeline custom metadata and provides alerting, trend analysis, and debugging tools so you can compare pipeline performance and quickly discover bottlenecks.

Databand.ai can tap into metadata like runtime statuses, logging info, and data metrics, and uses that info to build alerts on leading indicators of data health issues. You can proactively maintain the integrity of your data infrastructure and the quality of your data product with Databand. Databand.ai sends alerts to your email, Slack, Pagerduty, and others, so your DataOps teams can get ahead of any pipeline issues.

Airflow monitoring is pretty painful

Airflow has a great UI for monitoring high-level schedules and task statuses, but there’s a lot of room for users to improve their pipeline visibility. In the 2020 Airflow user survey, nearly half of respondents reported that logging, monitoring, and alerting on Airflow are top challenges or areas of improvement. As power users and contributors ourselves, we’re intimately familiar with these challenges.

Early Warning Signals on Long-Running Processes

Airflow is often used for orchestrating long running processes that take hours, days, or longer to complete.

Let’s take this example pipeline (also called a “DAG”) of a three step Python process:

What makes a simple pipeline like this a long running process? A lot of factors could be the culprit, including abnormally large volumes of “big” data, complex task logic, or database pooling causing waits in a queue.

Poor data quality and late delivery in critical pipelines spell trouble for your data product. End-users like data scientists, analysts, and business users depend on pipelines finishing within a certain window of time.

Being able to receive an early warning signal of pipeline failures or delays is a huge advantage — especially when other organizations lose money in these situations.

As a real-world example, let’s say our simple pipeline above was built by a stock trading company for analyzing market data. If the job starts at midday using prior day trading data and takes 4 hours to complete, it would need to finish around 4 pm to leave time for analysis and next-day strategy planning. If the job is running late, you want that notification as early in the process as possible.

Whereas an alert later in the process could mean losing an entire day of work, getting an alert at 1pm that tasks are delayed lets you get started on a resolution now. That early notice gives your team the ability to identify the cause of delay, fix the problem, and rerun the job before delivery is due.

Databand.ai can give you an early heads-up so you can intervene and fix the problem fast without losing time and money on wasted processing. With zero changes to your project’s existing code, you can use Databand to create robust alerting that gives you alerts on leading indicators of failures and delays.

Step 1 — Creating Leading Indicators of Problems

While you can use Databand’s production alerting system to notify on the duration of the overall pipeline, this alert is a lagging indicator of a problem. We want leading indicators — early warning signs.

A better approach is to use Databand.ai to alert on runtime properties of individual tasks in the pipeline. Databand.ai will monitor tasks as they run and send alerts if there are failures (or some other status change), if a task has an unusual number of retries, or if task duration exceeds some threshold. You can also receive an alert if a task does not start at the expected time. Setting these alerts on task runtime properties will give us insights on when data is at risk for late delivery much earlier than traditional monitoring methods.

Step 2 — Diving into the Data

So far, you’ve learned how to set run and task level alerts. With Databand helping you track those metrics, you’ll have a much stronger detection system for pipeline issues.

What are you missing?

To properly support our data consumers, it’s not enough to know that pipelines will complete on time. You also need to know that the data we are delivering is up to quality standards. Can the end-user actually work with it?

Luckily, you can use the Databand logging API to report on metrics about our data sets.

Databand can automate this process, but for this example, you’ll be defining which custom metrics are going to be reported whenever your pipeline runs.

When Airflow runs the pipeline code, Databand’s logging API will report metrics to Databand’s tracking system, where we can alert on things like schema changes, data completeness rules, or other integrity checks.

In the above screenshot, the logging API is reporting the following metrics from the pipeline:

avg score: a custom KPI reported from the create_report task
number of columns: schema info about the data set being processed
removed duplicates: reported from the dedup_records task
replaced NaNs: this metric is reported from the first task in the pipeline, unit_imputation

Databand.ai can alert on any metrics reported to the system, and you are free to create as many as needed.

Step 3 — Using Metrics as Progress Checks

Now, it’s time to add data visibility into your alerting process.
Using Databand.ai’s advanced alerting and anomaly detection, you can use the same metrics tracking function for gaining insight into internal task progress. This is particularly useful when you have complex pipelines with lots of tasks and subprocesses, or a few tasks with complex logic inside them.

With conditional alerts, we can create intra-task status checks by tracking if a metric fails to report within a running task within a certain timeframe.

In our example, the first task, unit_imputation, reports the number of replaced NaNs in our data set. Looking at historical run trends, you can expect the overall task to complete within 1 hour of time. Based on where you place the metrics log in our code, the metric is usually reported about 30 mins after the task starts. You can use the expected behavior to create a conditional alert that gives you great insights into what’s happening inside your process.

Alert Logic:

If duration of unit_imputation task is greater than 45 minutes AND replaced NaNs metric is missing, THEN trigger alert

First, let’s describe the logic behind this alert.
The NaNs metric should be reported about halfway into task runtime, which typically takes 1 hour to fully complete. Your alert adds some buffer to that, saying we want a notification if the task runs for 45 minutes without sending any metric. Missing the metric over this duration of time serves as an early warning that the pipeline is hung up on something that could lead to further delays downstream. Since the alert is set up in the first task of the pipeline, you have enough time to check out the issue, and restart the process after a fix.

And what was required to set this up? Only a single use of the logging API.

Databand’s alerting framework configures the rest in minutes.

Airflow alerting for staying proactive, not reactive

After following the provided logic, your alerting coverage includes:

Overall run durations
Task durations
Data quality metrics
Internal task progress

You’re now in a much better position to anticipate failures, delays, and problems in data quality before they affect your data product.
Databand.ai makes the process easy. Get your first alerts up and running with out-of-the-box metrics tracking, and get deeper performance insights with Databand.ai.

Try it out!

Start your 30-day free trial and get deeper visibility into your critical data pipelines. Or request a demo to check out our product in action!