The Data Outage Tax

Poor data quality is hurting you more than you know.

Manu Bansal
Plumbers Of Data Science
13 min readApr 5, 2021

--

Photo by Jp Valery on Unsplash

Data outages — occurrences of bad data — are draining business and hurting the productivity of data practitioners. But how bad is it? Should you care? And how do you even map out the cost of bad data? This article offers a framework.

Let’s start with a few facts.

Gartner [1]: “Recent Gartner research has found that organizations believe poor data quality to be responsible for an average of $15 million per year in losses.”

Forrester [2]: “Nearly one-third of analysts spend more than 40 per cent of their time vetting and validating their analytics data before it can be used for strategic decision-making.”

IBM, via Harvard Business Review [3]: “…$3.1 trillion, IBM’s estimate of the yearly cost of poor data quality, in the US alone, in 2016. While most people who deal in data every day know that bad data is costly, this figure stuns.”

Data quality issues are creating alarming levels of topline harm for businesses [image by author].

This data around the cost of data outages comes from trusted sources, but it can still be hard to make the case that your data outages are causing this level of harm.

After all, your data outages often look like small, isolated, one-off issues that produce limited impact for a subset of your organization. On the surface, they don’t look like a big, systematic problem — let alone a problem that’s causing big, systematic harm to your business’ topline metrics.

This makes it hard to build a business case around data quality solutions that address data outages in a big, systematic manner. Because no matter how much you trust Gartner, Forrester, and IBM’s numbers, you have lacked a framework for understanding the cost of these issues in your unique context.

We provide a framework in this article. It will enable you to better understand the cumulative, year-over-year damages caused by your data outages. It will help you begin to put some real numbers behind your own data outages. And it will give you the foundation for a viable business case for data quality solutions.

To do so, this article will demonstrate:

  • Data outages are their own category of problems, even though they appear to be isolated, one-off issues.
  • Data outages frequently cause a direct impact on topline metrics, and the damage they cause is only amplified by the ad hoc way that organizations remediate them.
  • The cumulative impact of these issues creates a cascading “tax” on organizations that drains their resources — often to a degree that organizations are not even aware of.

First Things First: What are Data Outages?

A data outage is a condition where data is broken in a way that is a severely impacting product, productivity, and decisions. And, in turn, those data issues are hurting user experience and business.

A data outage often appears as a break in the performance of a data-driven product, decision-making process, or ML model. These breaks are elusive and do not set off alarms from existing infrastructure monitoring tools. Instead, they only become known when they generate significant business impact and organizations scramble to find and remediate the source of the damage.

Data outages can appear in many forms, but they are all ultimately traced back to some sort of data quality issue. Real-world examples of data outages — that we have seen in client engagements or lead conversations — include:

  • An airline starts to sell $5,000 tickets for $50 because of an error in manually updating their dollar-to-euro conversion rate.
  • A trading application starts to send fewer notifications to trigger sales because of a badly formatted data field in a subset of their notifications.
  • A rideshare company starts to block legitimate users because of a small change in one data source that fed their fraud detection model.
  • A financial services company starts to generate inaccurate credit scores because of a format change in the data they receive from the Credit Bureau.
  • A job search engine starts to generate bad data or data drift that causes a drop in Click-Through-Rate (CTR).

These seemingly isolated, ad hoc issues actually represent a unified category of problems. In each of those data outages — and every other example we have seen— a data asset has broken in the “dataplane” that is created by the various forms of data moving in and out of the product, process or model. This dataplane exists independent of IT infrastructure and application software layers, and assets within the dataplane can break even when infrastructure is healthy.

This gives every data outage the same signature where broken data is moving through healthy infrastructure — what we call “good pipes, bad data” — and reveals them to be the same category of problems, no matter how different they might appear on the surface. (We dig deeper into the new category of data outages in a previously published two-part series. You can read part one here and part two here.)

When I was building Uhana (now part of VMWare), I frequently experienced elusive data outages first-hand, where the data would break anywhere in our predictive pipelines without setting off infrastructure alarms.

At Lightup, we have spent the last two years investigating these data quality issues in greater depth. We have spoken with dozens of enterprises about their own data outages and worked with many of them to solve these issues.

Through these conversations and our hands-on work solving them in the real world we have learned a lot about data outages — in particular the many levels of harm they cause, and the reasons why businesses often do not notice or measure the full, cumulative nature of this harm on their topline metrics.

The rest of this article presents a framework to understand the true impact of these data outages, and to present the business case for tackling them in a systematic, dedicated and urgent manner.

To demonstrate these points, we will explore:

  • The three ways data outages can damage topline metrics.
  • Why data outages cause more damage than organizations think.
  • How ad hoc remediation creates millions of dollars of productivity loss.
  • How to calculate the hidden “tax” that these data outages generate.

Worse Than They Look: How Data Outages Directly or Indirectly Damage Topline Metrics

Data outages occur in many forms and appear to generate many different types of impact. However, every one of these outages ultimately causes significant damage to topline metrics.

Many data outages directly impact topline metrics, primarily revenue:

  • An airline loses 99% of their revenue on each sale when they start to sell $5,000 tickets for $50 because of an error in manually updating their dollar-to-euro conversion rate.
  • A trading application receives fewer fees when they start to send fewer notifications to trigger sales because of a badly formatted data field in a subset of their notifications.
  • A rideshare company loses rides and/or pays compensation when they start to block legitimate users because of a small change in one data source that fed their fraud detection model.
  • A financial services company pays damages when they generate inaccurate credit scores because of a format change in the data they receive from the Credit Bureau, and they then sell those inaccurate credit scores to banks who issue loans that default.

Other data outages don’t look as harmful because their impact appears one or two layers removed from topline metrics. But ultimately, they create a cascade of problems that can be quantified as loss in revenue, profit, user acquisition, user retention, or another topline KPIs.

  • When a job search engine generates bad data or data drift that causes a drop in Click-Through-Rate (CTR) they will ultimately experience an indirect loss in Cost-Per-Click (CPC) revenue.
  • When an organization loses the data fidelity of a Know-Your-Customer (KYC) request issued to a third party, they will also experience a drop in user acquisition rates, which ultimately generates a loss in revenue that’s proportional to the Lifetime Value (LTV) of the users they did not acquire.

Finally, some data outages do not produce any traceable, quantifiable damage to topline metrics, but it’s clear they will impact core business performance in some significant way.

  • It’s hard to quantify the topline damage caused when an application masks failed user transactions due to data delay or data drop, but it’s clear the outage will delay remediation.
  • There’s no measurable impact when a new app version releases, begins dropping events and under-represents user engagement metrics, but it’s clear this outage will impair decision-making.
Data quality issues can harm a business through direct impact on topline metrics, indirect impact on stakeholder experience, and unrecognized latent impact [image by author].

In sum: Organizations now derive their competitive advantage from analytical decision-making and data-driven products. Bad data can lead to broken product experience, false conclusions, and counter-productive decisions being made by product teams and executive leadership. When the data engineering and analytics pipeline serves out bad data from an outage, it will fail to deliver the value it was built to deliver, and chip away at the topline of every business — often without raising alarms.

Hidden Data Outages: Why Organizations are Often Bleeding More Than They Know

Even when organizations know that they might be suffering damage to their topline metrics due to data outages, they rarely know just how many of these outages they are experiencing. This is due to the elusive nature of these outages, which hides their impact in three ways.

First, as discussed in our previous series defining data outages, this category of problems does not trigger infrastructure monitoring tools. They all carry the common signature of “good pipes, bad data” where the broken data causing the outage is carried by healthy infrastructure.

Because of this, most organizations believe that because they have established good software testing, good infrastructure monitoring, and good application endpoint monitoring, they will know if they stop carrying good data to their products, processes, or ML models.

But this is not true. An organization could experience many data outages — each of which is causing damage to their topline metrics — all while thinking everything is working just fine.

Second, even though data outages cause damage to KPIs that are being monitored, those outages often create either a latent, lagged, or slowburn impact that isn’t immediately visible.

When data outages create a latent impact, they often create drastic damage that only applies to a limited slice of users, and thus hides in rolled-up KPIs.

Examples of data outages that create a latent impact are:

  • An airline selling tickets at a 99% discount to their trans-Atlantic segment only.
  • A trading application sending fewer notifications to their Android users only.
  • A financial services company generating inaccurate credit scores for a specific geography only.

These data outages are very vertical — unlike IT issues where problems are often horizontal — and can hide in rolled-up KPIs for a long time before the organization notices that something is out-of-whack.

When data outages create a lagged impact, they create damage to data that isn’t being used in real-time, and thus the damage doesn’t appear right away.

For example, if an organization starts collecting bad data today but doesn’t use that data for three weeks, they won’t notice the problem until almost a month after it begins.

When they finally do see the problem it may be too expensive to fix, or it might actually be impossible to fix. If the organization’s customer data integration broke and they didn’t collect data at all for those three weeks then they can’t go back in time to collect that lost data. It’s gone for good.

And when a data outage creates a slowburn impact, the damage will either take time to register or it will totally escape attention because of the Shifting Baseline Syndrome.

For example, a data outage might cause a 1% drop in Daily Active Users (DAU) every day. This level of KPI degradation will not catch the attention of anyone monitoring it until a week or two has passed and it has created a catastrophic level of harm to the business.

Organizations can bleed from these latent, lagged, or slowburn data outages for a long time before they realize anything is wrong — or even worse, they can experience data outages that remain so small that they stay under the radar and silently, indefinitely bleed topline metrics.

Why data outages are so expensive — they go unnoticed in infrastructure monitoring tools, they often show up with a lag, and they always end up needing a digital war-room to root-cause and resolve [image by author].

Finally, even when an organization identifies a data outage, they typically do not see the full scope of the problem. They approach the outage as an isolated point problem and make solving it the responsibility of whatever stakeholder consumes the data that broke.

But that initial, identified point problem is often just one piece of the total problem. That point problem was caused by a data asset breaking, but organizations do not leverage a collection of well-separated data assets. Instead, their functions ride on top of a data plane with a complex topology that has multiple data pipelines — each arranged in its own specific lineage — that crisscross in fragile ways.

Data-driven organizations use this data plane as their nervous system. It is highly interconnected and one problem in one section can generate impact in many other far-reaching sections.

This means when a data asset breaks and creates a visible point problem it often creates other problems the organization does not know about.

In sum: The full scope of damage caused by that data outage will be the sum total of all harm caused by each of those point problems — most of which organizations do not know they should be looking for, and whose impact will remain unknown.

Adding Insult to Injury: Why Ad Hoc Remediation Amplifies the Damage Caused by Outages

Organizations create additional layers of topline harm when they approach data outages as point problems to be solved by individual stakeholders creating point solutions.

To create these point solutions, the individual stakeholders typically have to follow a long-winded affair that crosses many teams and consumes significant resources to resolve.

The typical mitigation sequence of data outages looks like this.

  1. An internal consumer of data notices the data outage because some data-dependent element of their work is no longer making sense.
  2. That data consumer then wastes hours or days (in)validating the data they have at hand to trace it back to the root cause of the issue. She has to follow along, cross-functional war room exercise to determine where the data actually broke within the pipeline — collection, transformation, or serving — and who produced the broken data asset.
  3. She then takes the problem to the producer of the broken data asset. They have to spend time repairing the source of the problem and then repairing the broken data by going back in time — whenever possible — throwing away broken data, and backfilling good data across the pipeline.

The whole process is time-consuming, resource-intensive, and creates significant productivity loss for data engineers and business stakeholders.

It also places significant pressure on multiple teams. This sequence is often triggered when the data outage attracts the attention of company directors and executives who demand the problem be fixed immediately, resulting in teams putting in throw-away work and taking shortcuts that accumulate huge technical debt.

In sum: These topline damages are rarely considered when calculating the total impact of data outages, but they layer on top of the direct and indirect harm caused by the outage itself… and that harm continues to accumulate every minute that teams attempt to remediate the outage.

Adding Up Your Data Outage Tax: The True Cost of Damage Caused by Hidden Data Breaks

It’s clear that data outages create a much greater impact than they initially present, and that you can build a strong business case around investing in data quality solutions or other systematic approaches to resolve them.

To do so, you simply have to define the total impact of your data outages. When you define this impact, consider the three sources of topline damage we have outlined in this article:

  1. The direct and indirect damage your data outages are causing to your topline metrics. You can calculate the direct damage pretty easily. The indirect damage can be hard to quantify, but you can create estimates around impaired decision making and similar issues. For reference: Every enterprise data outage we have seen that creates direct impact has caused a few million dollars worth of topline damages, on its own.
  2. The other data outages caused by every broken data asset you find. To do so, you have to first map out every other feature, process, or model impacted by every broken data asset, which can be an expensive exercise in-and-of itself. For reference: From the organizations we’ve worked with, a typical data outage causes problems for at least 4–5 other data assets downstream, and you can assume each of your own outages follows suit.
  3. The additional costs created by remediation. We have heard some engineers and analysts report that 40% of their time goes to debugging data outages. For reference: This creates millions of dollars of annual productivity loss for a pizza-box sized analyst team.

Stack that up and you realize that Gartner gave a conservative estimate when they stated enterprises experience $15 million per year in losses from poor data quality. The total for many enterprises is likely much higher — and the more data-driven the enterprise, the greater those losses will be.

Your own organization is likely bleeding millions of dollars per year in topline harm from your data outages, despite all the investment of time, attention, and budget put into remediating these issues. And then there’s the frustration it causes us — data engineers, data analysts, and data scientists — playing catch-up all the time.

We can do better than paying the data outage tax one day at a time.

Let’s bring order to data chaos!

References:
[1] S. Moore, How to Create a Business Case for Data Quality Improvement (2018), Gartner
[2] M. Goetz, G. Leganza, E. Miller, J. Vale, Data Performance Management is Essential to Prove Data’s ROI (2018), Forrester
[3] T. Redman, Bad Data Costs the U.S. $3.1 Trillion Per Year (2016), IBM via Harvard Business Review

--

--

Manu Bansal
Plumbers Of Data Science

CEO & Co-founder of Lightup, previously a Co-founder of Uhana. Stay connected: linkedin.com/in/manukmrbansal/.