How To Analyze And Resolve Data Pipeline Incidents In Databand

Eitan Chazbani @ Databand
Databand, an IBM Company
4 min readOct 18, 2022

This post was written by Niv Sluzki, Engineering Manager @ Databand, an IBM Company

A data pipeline failure can cripple your downstream data flows. Whether it failed to start or quit unexpectedly, you need to know immediately if there is a pipeline incident.

In this blog, we’re going to walk through how to analyze a failed Airflow pipeline and pinpoint the root cause of your data incidents.

Watch the video to see it in action or continue reading below.

Analyze pipeline health

Databand provides one view for all your pipelines so you can easily diagnose their health. From the pipelines screen, you can see all the existing pipelines that we are running and tracking. You can see the run status and any alerts associated with them.

You can search on the pipeline projects that you want to dive into. For this example, we’ll search in the “Service 311” project to find our “service_311_get_data” pipeline.

From here, you can select the pipeline and dive into all the runs associated with the pipeline to view its run history.

Conduct root cause analysis

Databand keeps the history of all your pipeline runs and their associated alerts. These critical alerts shown below should catch our attention.

Let’s dive into this failed run named “scheduled__2022–09“ and show the root cause analysis of why this alert triggered and the run failed.

Now we’re in the run details screen, and you can see the DAG visualization on the right side. This helps you understand the pipeline lineage, and where the exact pipeline step failed.

In this Airflow DAG you can see the red task of “write_to_dw” failed.

Why?

Well if you go to the left side, you’ll see all the logs for each task. In this case, it looks like there was a permission error due to an expired AWS token.

Get the whole picture

At the top of the run details screen are a few tabs you’ll likely use when debugging the run failure. These tabs allow you to get the whole picture of the impact of the failed run.

Here are a few highlights of the tabs.

Metrics tab

This metrics tab shows all the metrics associated with the run, you can also use this screen as an opportunity to set new alerts. Simply select the graph icon and scroll down to show them in the graph. This will automatically show the metrics trends.

Code and run info tabs

This tab shows all the code used within the pipeline execution and the run information environment.

Affected datasets

This tab is the most important because it shows any affected dataset due to the pipeline failure.

For example, you can see there are two issues/incidents with missing operations. One operation was a read operation, and the other a write operation.

You can see which operations worked successfully if you click the operations tab.

To see historical trends, select the graph dots to jump into previous runs.

Wrapping it up

For more information on how Databand can help you analyze and resolve data pipeline incidents, check out our demo center or book a demo.

This post was written by Niv Sluzki, Engineering Manager @ Databand, an IBM Company

--

--