It’s Time to Rebuild Data Quality with a Clean Slate: Too Much Has Changed

Manu Bansal
Lightup Data
Published in
6 min readJul 1, 2021

Data has changed dramatically in the last 10 years, and transformed the data quality problem — and it’s time that data quality tools caught up.

Your application just broke.

It’s mis-predicting credit scores, or selling flight tickets at a ridiculous discount, or blocking legitimate users from ride-sharing, or reporting outrageous retail sales numbers on a dashboard.

You trace the symptom back to a problem in your data pipeline. Your IT and API monitoring tools didn’t catch the issue — and neither did your data quality tools.

Problems like these are costing organizations an average of $15M per year. They are sparking new conversations about the data quality problem — now known by the shiny new name “data observability” — and the birth of new tools that are designed to catch the issues that legacy tools keep missing.

In this piece, we explore why this is happening now.

To do so, we’ll answer a few questions, that include:

  • What’s changed in the data quality problem in the new data stack?
  • What has made legacy tools fundamentally insufficient?
  • Why we are now forced to reconsider a problem that appeared to be solved for decades?

Ultimately, we will argue that solving the data quality problem that traditionally needed one-off interactive analyses now requires a radically new approach: continuous automated in-place monitoring. The need is urgent and the timing is ripe for disruption.

Here’s why.

Three Events that Made Legacy Data Quality Tools Insufficient

To understand the series of developments that completely changed data quality requirements, we need to look back at the evolution of the data landscape since the incumbent data quality tools made their appearance.

Informatica Data Quality was released in 2001. Talend was released in 2005. Comparable industry-leading tools arrived in the same window.

At this time, data was consumed in small, static batches, and most problems were created by manual errors (like leaving out the area code while entering a phone number). As such, the data quality problem could be solved by performing one-off interactive analysis of data pre-consumption to spot malformed records.

Legacy tools like IDQ and Talend were designed to do exactly this, and they worked well.

But then, three events in the last decade transformed all things data, and changed our requirements for data quality tools.

Event 1: The Birth of Big Data and ETL
ETL for big data began in 2006 with Hadoop and steadily penetrated the mainstream Fortune 500 enterprise segment in the next decade.

Event 2: The Birth of Cloud
Mainstream cloud adoption began in 2006 with AWS’ public launch and hit prime-time in the data stack with Redshift becoming fully available in 2013.

Event 3: The Birth of the Cloud Data Warehouse and ELT
Cloud Data Warehouses (CDWs) made scalable data warehousing accessible to everyone. Snowflake was founded in 2012 followed by Databricks in 2013. Along with offerings like Redshift and BigQuery, CDWs became the centerpiece of data stacks and drove the shift from ETL to ELT style pipelines.

These events created today’s world of “big data” and fundamentally changed the data quality problem — and made legacy tools insufficient — across eight factors.

Eight Factors That Changed the Data Quality Problem

The tectonic shifts in the big data landscape transformed the data quality problem by ushering in fundamentally new factors:

Factor 1 - Increased Data Volumes: Big data lakes and warehouses hold data volumes so enormous that it is too expensive, too slow, or infeasible to load complete datasets into a separate data quality tool for one-off interactive analysis. Moreover, it is impractical to inspect individual violations even if computation was not a bottleneck.

Factor 2- Increased Data Cardinality. Applications now utilize thousands of data tables with hundreds or thousands of columns each that can’t be inspected manually.

Factor 3- Increased Data Stochasticity. There’s now so much data volume, variety, and interdependence, that some degree of imperfection in data is to always be expected. A data quality test needs to be statistical in nature that evaluates data health against a non-ideal baseline. Hard rules are too restrictive.

Factor 4- Continuous Flow of Data. Data now arrives every hour or minute and must be used right away, necessitating near-real-time automated issue detection on a cadence.

Factor 5- Disaggregated Processing Pipelines. We now have automated ELT pipelines that are made up of several best-of-breed tools working together at different stages with new, unique forms of failure.

Factor 6- Dynamic Data Shapes. Data is now entrenched into product and analytics pipelines, and data models evolve continuously alongside software with CI/CD. Data quality checks need to keep up with evolving data shapes.

Factor 7- Dataflow Topology/Lineages. Data pipelines have a dozen stages and many branches, adding a spatial dimension to data quality issues due to cascading effects.

Factor 8 - Timeseries Problems. Data flows continuously in small batches, adding a temporal dimension to issues and demanding quality checks against historical reference points.

These technical changes — alongside cultural changes like the desire for API-first design — have added up and created a new data quality problem that must be solved in a new way.

Data is now consumed in large, high-cardinality flows, and most problems are now created by machine errors (like software bugs or failed data transformations). As such, to solve the data quality problem today, you need to continuously monitor data quality metrics in-place to detect issues at scale and pinpoint the software operations at their root.

It has become painfully obvious that too much has changed, that legacy import-and-inspect tools do not work in the new world of data, and that we need to rethink — and retool — the data quality problem from a clean slate, based on the new premise of automated monitoring.

Why Now is the Time to Revisit Data Quality

Data quality has been a thorn in the side since we started on the big data journey. We have been dealing with increasing failure points and fragmented dataplane monitoring ever since. Support for data quality has become thinner than before, all while companies are depending on data more than ever. But it’s been unclear what a solution could look like. Until now.

After a period of heavy flux in the big data jungle transitioning from ETL to ELT, a new and stable data stack has emerged. This stack is a disaggregated “open data ecosystem” built with best-of-breed components on top of the data lake or warehouse — the centerpiece of the new stack.

The warehouse, lake, or lakehouse, and a stack built on this architecture comes with its own challenges — far less data integrity checks and constraints being enforced than traditional databases and a data pipeline with more components in it than ever before, each of which can fail independently.

Despite the challenges, the new ELT stack is way more structured than the big data architectures that preceded it. The warehouse or the lake offers a central point of integration where data quality can be observed across the entire pipeline as data progresses successively from raw dumps to finished assets. And it’s a substrate that is scalable by design. A data quality solution that leverages the data lake or the warehouse can deliver continuous, comprehensive, in-place data quality monitoring out-of-the-box.

It’s a radical rethinking of data quality monitoring but there has never been a better time to solve this problem. The issue is urgent with our dependence on data being highest in history but the problem is finally tenable for the first time in more than a decade.

There is light at the end of the tunnel.

Lightup brings order to data chaos. We give organizations a single, unified platform to accurately detect, investigate, and remediate data outages in real-time.

To see if Lightup can solve your data outages, take the right next step.

Note: A modified version of this article was originally published by The New Stack, under the title “The Data Quality Problem and Its Impact on Application Performance”.

--

--

Manu Bansal
Lightup Data

CEO & Co-founder of Lightup, previously a Co-founder of Uhana. Stay connected: linkedin.com/in/manukmrbansal/.