Your Data Keeps Breaking Silently: Isolated Incidents or a New Category of Problems? (Part 2)

Manu Bansal

Published in

Lightup Data

9 min readMar 4, 2021

Those hidden data outages you keep experiencing?

The ones that slip past your monitoring tools and impact your business?

They are not the isolated, one-off issues they appear to be.

And you don’t need to keep developing ad-hoc solutions to resolve them.

They are a new category of problem, and they can be resolved with a single solution.

This article will explain how we arrived at these conclusions.

Discovering a New Category of Data Quality Problems

This is part two in our series on hidden data outages.

In part one, we began to investigate a unique class of issue where data would break silently and impair the performance of a product, a decision-making process, or an ML model.

We learned two things about these issues we came to call “data outages”.

First, that they are universal. Every organization is experiencing them in some form.

Second, they are elusive. They routinely slip past standard IT/APM monitoring tools and only get noticed when they cause significant business harm.

But from a structural perspective these outages still looked very different from each other. The outages we found all damaged different KPIs, and impaired different product functions, decision-making processes, and ML models.

This left us with a few questions:

“Are these isolated incidents, or a new, unknown category of problems?”
“Will these outages always elude us, or can we proactively catch them all?”

And, most important:

“Do we have to keep building point solutions to resolve them individually, or can we build a generalizable solution that is capable of resolving them all?”

We continued our investigation to answer these questions.

This article will present the rest of our investigation, and demonstrate that:

There is now a data plane driving modern applications. By making their applications data-driven, organizations have created a new component — the data plane — that is logically decoupled from the underlying IT infrastructure.
This new data plane can break independent of infrastructure. This component can generate bad data even when infrastructure is healthy, resulting in the common signature that all of these outages share — ‘good pipes, bad data”.

To demonstrate these points, we will discuss:

The anti-pattern of data stakeholders mitigating point problems.
Why these disparate outages are all instances of the same category.
Why this new category of problems is so elusive.
A hint of a solution recipe: dedicated monitoring of the data plane for detecting issues that would cause data outages.
Why we refer to these issues as “data outages”.

To begin, let’s pick our investigation back up where we left it.

The Anti-Pattern: Stakeholders Mitigating Point Problems with Point Solutions

At this point in our investigation, we had spoken with dozens of organizations, and learned they were all experiencing hidden data outages that looked similar.

And yet, organizations were classifying these outages in many different ways — typically based on the type of data that had broken and caused the issue. Broken analytical data became an analytics problem, broken APIs became a software problem, and broken ML models became a data science problem.

Data is now everywhere, and can break anywhere.

To illustrate the point, consider the examples from part one of this series.

The airline that started selling $5,000 tickets for $50 because of a currency conversion data entry error classified their issue as a business operator problem.
The financial services company that started returning inaccurate credit score estimates due to an external API feeding data in the wrong schema classified their issue as a data integration problem.
The trading application that stopped sending trading notifications because of gaps in the implementation of a new feature classified the issue as a software bug.
The ridesharing company that started labeling most of their legitimate contractor transactions as fraud because of a bad parameter feeding their models classified their issue as a data science problem.
And — in a new example — when an organization’s KPI dashboard started showing bizarre numbers due to faulty data feeding their charts they classified the issue as an analytics problem.

And after organizations classified these outages in this manner, they typically assigned the problem to whatever stakeholder was consuming the broken data.

But because data consumers are so diverse and disconnected from each other, these problems looked very fragmented and siloed — even when they occurred within the same organization.

As such, it’s no surprise that the specific stakeholders impacted by these seemingly siloed issues were building point solutions to resolve them.

This approach looked very familiar. It was how I used to solve these types of hidden data outages before Lightup — with ad hoc, isolated, stopgap fixes.

But now we had something I didn’t have before Lightup — a broad perspective on these outages. Because we had collected a wide range of examples of these outages, we were now able to look at each of these problems and their solutions from a different — more unified — vantage point.

And from this new perspective, a new picture of these outages came together.

A Broader Perspective: Diverse Data Outages as Different Instances of the Same Problem

When reviewed together, these outages look technically very similar to each other. The underlying problem is always some kind of break in some piece of data that was feeding the product, process, or model through the infrastructure.

And the organizations that experience these outages have a lot of data that can break in a lot of ways. They are all data-driven organizations with data-driven products. They all leverage a complex data ecosystem with many different data flows and interactions. Data flows into their product, data flows out of their product, and data feeds back in a loop with a decision maker or an ML model on the way.

The complex data-driven ecosystem that drives modern applications.

A hidden data outage can strike any of the data assets interacting with their application with the same result — bad user experience and negative impact to the business. But the signature will remain the same for the outages that they missed:

No matter who was impacted by the problem, the problem always traces back to a data quality issue with the underlying data source.

These outages can be thus attributed to the source data asset that broke instead of the stakeholder the break impacted. And with this change in attribution, these outages stop looking like diverse problems. Instead, they can all be classified as diverse presentations of the same problem — a broken data asset in the application ecosystem.

This new perspective clicks many other answers into place — including why these outages are always so elusive to begin with.

The Tricky Common Denominator: Good Pipes, Bad Data

Remember: In each of these cases, the outage is never caught by IT/APM monitoring tools. The underlying data quality outage is always wholly independent from infrastructure health.

The IT infrastructure supporting the data is always healthy, even when it’s carrying unhealthy data. It does not matter what kind of data broke, or where in the ecosystem that data appears, it is always the same problem — the broken feature, process, or ML model is always caused by bad data moving through healthy infrastructure.

In other words, it’s always a problem of “good pipes with bad data”.

Or, to stretch the analogy, “good barrels holding bad data”.

These outages keep eluding detection because the data breaks somewhere IT/APM tools aren’t looking, for many different reasons, including:

Human error in recording data.
Error in a third-party service responsible for collecting data.
Unannounced changes in the format of an external data feed.
Logical error in a data transformation module.
A specification mismatch between a data source and a data target.
Semantic errors in a data retrieval query.
And many more.

The sources of these hidden outages suggest that there is a new layer in the stack of a data-dependent application that is completely independent of the IT infrastructure and application software layers: the data plane. This layer can fail independently and hurt user experience or business just as bad as failures in other layers.

The dataplane: Where data can break independent of the infrastructure that carries it.

Conceptualizing this new data plane explains why these outages slip past IT/APM alarms and scans. These tools are only monitoring the health of the infrastructure. They are not monitoring the health of what is inside the infrastructure — the data the pipes are carrying.

And it also explains why these outages are only noticed when they have harmed the business. Downstream business KPIs are the only part of these data quality outages that organizations typically have any visibility into and actively monitor.

This data plane is a new thing that none of the organizations we spoke to had any visibility into, and we only saw it once we put all of these pieces together.

And finding the data plane opened the next stage in Lightup’s development.

Hint of a Solution Recipe: Dedicated Data Plane Monitoring

This data plane is the layer where data is held and moved, and it encapsulates all of the data assets interacting with the organization’s product and stakeholders. It’s the common layer where we can detect all of the data issues we encounter, regardless of who they are impacting.

A data plane monitoring solution can track the data assets in this layer, just like IT monitoring tools track virtual machines and containers, and APM tools track application endpoints. And with this approach, we can build a general-purpose monitoring solution for the data plane and apply it to the entire data ecosystem to broadly and rapidly detect these outages.

These issues are best thought of as “data outages”. The same way you can experience an “outage” in some component of your infrastructure, we now see that you can also experience an “outage” in some component of your data itself.

And just as you don’t deploy a separate IT monitoring solution for each team or make IT monitoring the problem of individual stakeholders, with the right solution, you won’t have to solve data quality outages in an ad hoc way.

Because this diverse range of outages are all just the same data issue showing up on multiple points in the data plane, they can all be solved once, and in the process we can create cross-data plane and data pipeline visibility in order to map — and resolve — their cascading impact.

To sum it up: Data breaking from issues in this data plane represents an entirely new kind of problem — one that requires a complete new set of tools to solve.

The End of the Investigation: The Beginning of the Lightup Platform

Ultimately this investigation demonstrated two conclusive points regarding these business-critical data outages.

Modern applications have created a new component — the data plane — that is logically decoupled from the underlying infrastructure. These outages are all caused by some broken data asset within this data plane.
Data assets within this data plane can break even when infrastructure is healthy. This gives each of these outages the same signature — “good pipes, bad data” — which represents a new category of problems.

In future articles, we will discuss:

The true cost of data outages emanating from those cascading effects.
Technical subcategories of data quality issues causing outages.
The template of an ideal data quality solution.
Challenges we see organizations face in building their own data quality tools.

If you want to resolve your organization’s data outages, reach out today.

Lightup brings order to data chaos. We give organizations a single, unified platform to accurately detect, investigate, and remediate data outages in real-time.

To see if Lightup can solve your data outages, take the right next step.

Learn more by visiting lightup.ai.
Schedule a demo to see our solution in action.
Or, directly start a free trial now.