Your Data Keeps Breaking Silently: Isolated Incidents or a New Category of Problems? (Part 1)

Manu Bansal

Published in

Lightup Data

8 min readMar 4, 2021

You never saw it coming.

Your monitoring tools have been silent.

But your ride sharing application just broke.

You have a feature that gives users a precise time when to expect their ride to arrive.

Your app calculates this time by pulling data from external sources like Google Maps.

But suddenly, your app starts giving your users inaccurate wait times.

Something is off and you need to find out what it is — fast.

So, you open up your monitoring dashboards.

Server. Application. APM. You check them all out.

They all look clean, but the problem is still there in the product. Your customers are still complaining, and now your business leaders have joined in.

*Why do your applications keep breaking without setting off infrastructure alarms?*

With no clear lead on what the problem could be you just start guessing.

You pull data from different points in the product data pipeline.

Finally, you see it.

Your data broke.

Your app stopped parsing external data from Google Maps properly.

You had no clue this data interface was broken, and you don’t know why.

You only know this: Whatever happened, you need to fix it.

So, you assemble a stopgap solution.

It works… for now.

But soon your app’s data breaks again, in a new place, and the cycle repeats.

A New Perspective on Hidden Data Breaks

This scenario of elusive data breaks is becoming all too familiar.

I constantly encountered data issues like these at the last startup I built.

And conversations with data leaders in many organizations, across many verticals, showed that these business-critical data issues have been appearing everywhere over the last few years.

These conversations have given us a fresh perspective on these issues, and led us to develop a new and more effective approach to solving these issues.

This is the first article in a two-part series that will present our finding that these issues represent a fundamentally new category of data quality problem that we have come to call “data outages” (for reasons we will explore in part two).

In this article, we will demonstrate that:

These issues are universal. These business-critical data outages appear to be a widespread problem that is affecting companies across a diverse range of verticals and industries.
These issues are elusive. They remain hidden from infrastructure monitoring tools, and are only noticed when they escalate to the point they cause significant business impact.

To demonstrate these points, we will discuss:

My own experience encountering these product-critical data outages.
Five diverse examples of how other companies appear to be experiencing similar problems.
The common pattern that appears in each of these seemingly isolated outages, and what this pattern implies.

To begin, we will explore how these outages manifested at my previous startup — Uhana, acquired by VMWare in 2019.

Hidden Data Outages: Unexpected Issues at Uhana

Uhana developed a predictive data platform for telecom organizations.

It gave service providers real-time visibility into their customer’s experience with their mobile network service. The solution included a predictive data pipeline that monitored end-user service levels in real-time and used that data to generate predictions around lost coverage, bad user experience, and the like.

Most of the time, the solution worked perfectly.

But sometimes, its pipeline broke and generated bad predictions.

And these product breaks always came out of nowhere.

Uhana’s teams had done everything possible to prevent them, and to maintain the health and integrity of the pipeline. They had followed best practices and deployed real-time infrastructure monitoring tools.

But these tools never raised alarms around these problems, and it only became clear that something had happened when the pipeline’s predictions went so askew that the end users would notice and question the results.

Even after the problem became apparent, the team’s IT and APM dashboards gave no answers on what caused the break. They always had to guess the source, test their hypothesis, and troubleshoot each problem manually.

The whole process was ad-hoc, reactive, frustrating, and it distracted Uhana’s teams from building pipelines, predictive models, and analytics techniques.

Worst of all — the pipeline continued to serve out bad data the whole time the teams were digging through its issues. That was unacceptable, and led them to build failsafe mechanisms as a stopgap solution to invoke when they noticed data discrepancies.

It worked, but it wasn’t the ideal solution — by far.

There had to be a better way to find and fix these elusive problems.

And it seemed like these data outages probably were not unique to Uhana.

A Common Issue: Hidden Data Outages are Breaking Products Everywhere

At Lightup, we have assembled a world-class team that is dedicated to investigating and solving these data outages.

We began to reach out to other companies and data teams, and to simply see if anyone else seemed to be experiencing similar problems.

Across products, companies, and industries, the same story emerged.

Everyone was experiencing a similar type of unexpected product break.

Here are five examples that demonstrate the breadth of the issues we found.

Example 1: An airline inadvertently started selling $5,000 tickets for $50.

The issue slipped past all of their monitoring tools and guard rails.

They only noticed the issue after they had been selling high-priced tickets at a 99% discount for hours. They honored these tickets and lost a large amount of money.

They eventually learned that the individual who was responsible for manually updating the dollar to euro conversion rate put the decimal point in the wrong place. This populated the wrong value through their portals, and created the glitch fare.

Example 2: A financial services company unknowingly started selling inaccurate credit score estimates.

Their monitoring tools never caught their problem either.

They only noticed there was something wrong when their clients — banks — noticed a problem. The banks were writing loans to their customers. They qualified these customers using credit score estimates from the financial services company. Borrowers with 700+ scores started defaulting within 3 months. That never used to happen. The banks realized the credit score estimates they were basing their loans off were inaccurate.

They eventually realized what happened. They based their credit scores off of data from the Credit Bureau. But the Credit Bureau had changed the format of the data they supplied. The financial services company did not know this and processed the data incorrectly, which in turn produced the inaccurate scores they sold to the banks.

Example 3: A trading application quietly stopped sending trading notifications.

They sent these notifications to users when stock prices moved significantly and it was a good time to make a trade.

Eventually, they noticed that a subset of their users had been making far fewer trades — and thus generating less revenue — for an entire month.

They eventually realized they had a badly formatted data field in a subset of their notifications. This caused the app to send fewer notifications to these users, who then made fewer trades, and generated less revenue.

Example 4: A transaction processing company did not see their monetized analytical data turn to garbage.

They processed one billion transactions per week, and collected rich data on shopping patterns for every transaction. They then monetized this data by selling it to relevant third party organizations. But once per month, their analytical models began to produce mislabeled transactions.

Their monitoring tools did not catch this. They only realized they were selling tens of millions of instances of bad data every month when their customers realized there was something wrong with the information they purchased.

They eventually found the source of the problem. There was always one or more pieces of data that were malformed in one or more of their fields. Sometimes the merchant category would come out null. Other times the country code might be populated incorrectly. In each case, the malformed data caused transactions to be mislabeled, and become useless for their algorithms.

Example 5: A ridesharing company suddenly began labeling most of their legitimate contractor transactions as fraud.

They had been targeted by a sophisticated fraud pattern, where malicious actors would siphon funds from their drivers. In response, they created fraud detection models to block illegitimate usage.

But all of a sudden, their fraud detection model began to create 99% false positive rates, and began to block legitimate usage. The problem was detected only after blocked users started to complain. A stop-gap solution was implemented to bypass the fraud-detection model because the root cause took a month to determine. The company was caught between a rock and a hard place. They chose to eat the cost of compensating their drivers for fraudulent transactions during that time instead of blocking out legitimate users from the service.

They eventually realized what broke in their fraud detection model. Their model was trained by analytical data and parameters from the product, and it was very sensitive to changes in these data or parameters. In this case, one critical configuration parameter started to come out empty due to a change in the data source and the fraud detection model was not updated to point to the new data source.

On the surface, these all appeared to be isolated, unrelated incidents.

But a pattern began to emerge when they were aggregated and reviewed one after another.

Looking Deeper: The Common Pattern Among These Hidden Data Outages

Each of these incidents share a set of common characteristics.

They all impacted the functionality of a data-driven product, or impaired either a data-driven decision-making process or an ML model.
They all slipped past existing monitoring tools. In every example, the company had set up standard IT and APM dashboards, guard rails, and alerts. But none of these measures noticed any degradation in the infrastructure supporting the product or decision-making process.
They all were only noticed when they created significant harm to business KPIs. The actual harm varied. Some directly impacted revenue generation. Others damaged user experience. Still others led to poor decisions or broken ML models. But in each case, the company only realized something was wrong after it was too late.
Ultimately, these issues were always traced back to some kind of break in some of the data feeding into or out of the product.

These four characteristics track with every example given here and countless more problems that affect data-driven products — and they point to a new perspective to frame these data outages at the root of unexpected product breaks.

No matter how random, isolated, or non-related they appear to be, they all might be symptoms of the same underlying problem.

And if they all speak to the same problem, they might all be addressed with the same solution.

The Search for the Solution to Unexpected Data Breaks

Up to this point, Lightup’s investigation demonstrated two things about these business-critical data outages:

That these are universal issues affecting a diverse range of verticals and industries.
That they are elusive, and remain hidden to infrastructure monitoring tools to the point they cause significant business impact.

However, a big question remained that drove the rest of this investigation.

“Are these outages all just one-off data breaks that must be resolved with ad-hoc point solutions — or do they all represent the same category of problem that can all be resolved with the same generalized solution?”

Part two of this series will present the answer to this question by walking through the remainder of this investigation.

But if you want to immediately resolve your organization’s data outages, reach out today.

Lightup brings order to data chaos. It gives organizations a single, unified platform to accurately detect, investigate, and remediate business-critical data outages in real-time.

To see if Lightup can solve your data outages, take the right next step.

Learn more by visiting lightup.ai.
Schedule a demo to see Lightup in action.
Or, directly start a free trial now.