By the time bad data is in your warehouse, it’s already too late (plus the leading indicators of error)

John Blust
Databand, an IBM Company
7 min readJun 7, 2021
leading indicators data pipeline architecture

By the time you realize you have a problem, it’s typically too late. That is, a small problem has now grown into a big problem — one so large, you and others actually notice. And whatever you’ve noticed is, unfortunately, only a downstream effect, and probably one of many. To catch that root issue and keep it from recurring, you have to trace the error all the way back through your data pipeline architecture … and that’s where things become tricky.

Without a firm sense of the leading indicators of data pipeline error, you’ll catch errors late. The longer errors exist in your pipeline, the more problems they cause. The more problems they cause, the busier you are addressing all the downstream issues, which keeps you from addressing the root causes. You will forever be fighting fires.

Take this simple example. If it takes an hour to process one day’s worth of data, and you notice an error three days late, you’re now backed up not just one hour, but three hours. If you’re lucky, the issue is reversible, and only internal employees are complaining. If you’re not so lucky, as is the case if your pipeline feeds an ecommerce system that’s now generated a stream of irreversible transactions, you also have customers to deal with. Problems beget more problems.

As we will argue, the cost of catching errors late is substantial. But the biggest casualty by far will be everyone’s trust in the data itself. (And by extension, you.)

Scenarios where leading indicators can save your pipeline & your reputation

Errors in your data pipeline architecture can have real costs. If your “consumer” is an internal sales team whose dashboards are all down, that’s one kind of cost. But if your data customer is external, the cost is greater and can cut into your brand reputation. If you’re the upstream data provider and customers now can’t run their machine learning models, you’ve now exported your data error to other companies like a bug, and they’re also now facing that same compounding cost of error.

We might call this the catch-up multiplier effect: The longer it takes to fix a data pipeline architecture issue, the more the cleanup cost compounds.

IBM discovered this problem in the early 2000s. In a study published in the National Institute of Standards and Technology (NIST), engineers there demonstrated that an issue that would take one hour to solve if caught immediately, would take 15 if caught at the design phase, and 100 if not caught until after launch.

data pipeline failures technical debt
The cost of an error compounds with time.

Herein lies the problem in how most data pipeline architectures are built — they’re linear. They’re also often entirely without safeguards, checks, or any of the DevOps principles used to assure quality in software. Your job as the engineer is to implement these checks, alerts, and controls. The better you can become at identifying leading indicators of data pipeline issues early, the faster you catch them and the smaller prices you pay.

This is precisely why we built Databand. It’s a data pipeline observability tool created for this express purpose — to help engineers identify data pipeline issues early, but also track them back to their source to understand the root causes. With Databand, teams can identify what upstream data issues caused the downstream error they or someone else has noticed. And even better, they can set automatic alerts.

For example, in Databand, you can set alerts for leading indicators such as missing data, aberrant data, or suspicious values. As Databand.ai’s founders put it, “To a downstream user, every problem will appear as a data quality problem. Our job is to find what’s really causing it, and ideally catch it before anyone realizes something’s amiss.”
Next, we explore two common data pipeline architecture scenarios, what can go wrong, and specifically, how an observability tool like Databand can help.

Scenario 1: The data arrives late (issue: freshness)

If you don’t catch it:

Let’s say your pipeline takes a day to process. You schedule that process in the evening and thus it takes you one day to notice a column type change error. Now, not only do you have to re-run it, you have to also catch up on two additional days of data — the prior day’s data, and now the data from today. Now, your pipeline will take 20 hours to run instead of 10.

You could revert, but now the catchup multiplier clock is ticking. The more individuals throughout the business rely on that data, the more people are now using corrupted data, and worse: You don’t have a good way to communicate to them what’s happening or that they should pause and wait for new data. So it’s infiltrating all your other systems.

This happens quite often. In the case of one company processing applications, this scenario caused a month’s delay. They were expecting a certain subset of data in the monthly load, but it never showed. The engineers weren’t alerted to the issue until 72 hours later. It took another 72 hours to resolve, and due to scheduling issues for the meetings that followed, that fix was not implemented for 20 working days.

If you do catch it:

The issue is fixed on the same day and you rerun the pipeline before it gets backed up. With a tool like Databand, you were able to create checkpoints with critical metric logging. When data didn’t come back as expected, you got alerts. You can contact the vendor, they reset things, and no harm was done.

You can run these ongoing checks for the dimensions of data quality health, at any point in your pipeline:

  • Fitness — is this data fit for its intended use?
  • Lineage — where did this data come from? When? Where did it change? Is it where it needs to be?
  • Governance — can you control it?
  • Stability — is the data complete and available in the right frequency?

You may also hear the above referred to as the following:

  • Freshness — did it arrive on time? (SLAs, durations, data landing times)
  • Completeness — did the full data arrive? (Nulls, counts, schemas)
  • Accuracy — did true and correct data arrive? (Distributions, skew, domain-specific metrics)

Scenario 2: The data is on time, but is missing values (issue: completeness)

If you don’t catch it:

Lots of systems these days are connected through APIs and HTTP, and we all know you can receive a 200 (success) but no actual data. The “success” refers to connection, not transfer, to say nothing of accurate transfer. Most systems reliant on APIs simply weren’t built to check for data health and this means there’s really no check against the most common sources of errors, namely, users. It’s all too easy for an administrator to mistype, which throws off 100 entries that then propagate throughout the system, and then the catchup multiplier effect takes hold.

Depending on how your other connected systems are built, you may catch this error eventually. But not before some systems have accepted and processed all the wrong values. If you don’t catch those values when they are first deposited, you’re left looking for downstream effects — like a dashboard showing the wrong values (and causing difficult conversations). Or, as in one actual scenario, a payroll system stops sending paychecks to everyone including the data engineers.

If you do catch it:

With an observability tool, you got an alert before the data was transferred, early in the data pipeline architecture, and you simply hit pause. You checked the error, traced it back to its source, corrected the misset values, and ran it again.

Furthermore, you could use these alerts to always validate the data’s existence at every stage. If the check ever finds the data doesn’t exist but should, it’ll rerun that stage. If the data exists but contains an error, you can set conditional actions which include paging you — whether for wrong schema, skew metrics, or what have you.

How can you catch more leading indicators?

As we covered in the scenarios above, you can catch more of the leading indicators of data pipeline architecture issues in four ways:

  1. Introducing checkpoints
  2. Establishing trend baselines
  3. Tracking data lineage and establishing the root causes of issues
  4. Measuring metrics for pipeline executions and data input

When you build your data pipeline architecture with proper tracking, it becomes transparent and decodable. It ceases to be a black box. You can catch issues when they first occur, anywhere in your data pipeline, and act before those issues spread and the catchup multiplier effect causes the problem to creep into other departments.

As they say, the best data engineers are invisible. If you’re doing your job, everyone else is doing theirs. The more alerts you have, and the better leading indicators you can detect, the sooner you’re able to catch issues. And if you do that, everyone can trust the data — and you.

Want to catch bad data before it reaches your consumers? Request a demo to see Databand.ai in action!

--

--