Improving Data Observability for Efficient Data Troubleshooting

Published in

Data Room

5 min readJun 5, 2023

As a dedicated data engineer, a significant portion of my daily routine revolves around tackling mundane debugging issues, such as identifying a specific data row that hasn’t been updated. However, the root cause behind such occurrences can be diverse, ranging from unexpected interruptions in data pipelines to downstream reporting failing to fetch new data or even discrepancies caused by other applications modifying the same row with different values. I once found myself spending an entire day fixing similar problems, only to discover that the user had neglected to submit the data after inputting it. What a frustrating experience.

That was until I became familiar with the concept of “Data Observability” and began leveraging a powerful tool called Databand. This tool allows me to gain a comprehensive understanding of the health and state of data within my system at a glance.

“Essentially, data observability covers an umbrella of activities and technologies that, when combined, allow you to identify, troubleshoot, and resolve data issues in near real-time.” quoted from Databand.

Now let me introduce you to a part of the laborious workload faced by data engineers and how Databand accelerates the process of resolving issues, freeing up valuable time to develop more meaningful and impactful functions.

Scenario:

Consider the following example of a rough interest rate calculation flow: First, we extract and join mortgage application details and applicants’ personal information from DB2. Then, we retrieve applicants’ credit scores from a credit reporting system hosted on PostgreSQL. Using all the reference data, we search for corresponding rates in MongoDB. Finally, we summarize the information into a report for business users, notifying them of mortgage application results.

Now, imagine if the business users don’t receive updated reports on time. As depicted in the diagram below, it often takes days of communication and investigation to identify the underlying cause, resulting in delays in the mortgage application process and potential revenue loss.

Despite spending time on communication, as shown in the picture below, data engineers face challenges in disentangling the complexities of data pipelines. This is because data sources and pipelines are hosted on multiple systems and operated using different ETL tools or programming languages, which further contributes to the time-consuming nature of this task.

To address these challenges, Databand allows data engineers to promptly grasp the current status of their data. Let’s explore the measures we can take on different timelines:

1. Production support escalated by business users:

· Clicking the ‘Dashboard’ button provides an overview of data observability throughout the entire company. It offers insights into top errors, failed pipeline runs, current pipeline runtimes, and runtime predictions, all presented in a single view.

· By navigating to the ‘Datasets’ tab and searching for the specific dataset, such as “APPLICANT_INTEREST_RATES”, we can access detailed information about the table, including a condensed view of data operations and modification history performed by pipelines and runs.

· Further investigation involves examining the failed run, which Databand presents as a comprehensive data flow structure. By clicking on individual tasks, such as “CREDIT_SCORE”, we can access table schemas, historical writing and reading metrics, and operation logs. Once we identify the root cause, we can link directly to ETL tool interfaces to resolve the issues.

· Additionally, when encountering errors within a dataflow run, it’s essential to review the health of the data pipeline. Clicking on “Pipelines” provides information about each run of a specific pipeline, allowing us to evaluate runtime and failure ratios.

2. Error handling before escalation

· Escalation can often be avoided by promptly addressing errors. Databand offers an ‘Alerts’ function that notifies stakeholders immediately when errors occur. Multiple types of alerts can be configured, including process quality, data quality, and data delay, with predefined templates ensuring accuracy and efficiency.

3. Proactively identification of potential mistakes during development

· Databand includes an alert type called “Schema Change”, which proves to be highly valuable. Rather than reacting to problems, we can anticipate and prevent them. In situations where multiple engineers or teams work concurrently, unintentional changes to table schemas can impact downstream applications like a domino effect. By setting up Databand schema change alerts, we receive notifications when modifications could potentially affect other datasets and pipelines, enabling us to address issues proactively.

In today’s data architecture, characterized by hybrid cloud environments, multiple systems, and diverse development tools, managing data integration pipelines has become akin to untangling complex threads. With data observability, as depicted in the diagram below, we can proactively manage our data process lineage and efficiently handle data-related issues.

Let’s free our hands from trivial work to embrace meaningful development!

Improving Data Observability for Efficient Data Troubleshooting

Written by Lyn Chen