Data Quality — A Primer

Published in

Memory Leak

5 min readOct 13, 2020

Data is becoming more critical for running a highly functional business from ad hoc data analysis to Business Intelligence (BI) to Machine Learning (ML). Data quality ensures data is fit for consumption and meets the needs of data consumers. Historically individuals did not have data quality tools and had to manually identify data issues. Over the past two years, we have seen the emergence of numerous data quality solutions and believe it is a core component of the modern data stack.

To be of high quality, data must be consistent and unambiguous. You can measure data quality through dimensions including accuracy, completeness, consistency, integrity, reasonability, timelines, uniqueness, format, validity, and accessibility.

Data quality issues can come in many forms. Often they are the result of database merges or systems/cloud integration processes in which data fields that should be compatible are not due to schema or format inconsistencies. Data pipelines can break upstream so data that is used to train a model is not updated. Pipeline breaks can corrupt data by changing the unit or format of the data. There can be “silent failures” (issues you don’t know you have) like a data owner changing the input metric unit from thousands to millions. Some refer to periods of time when data is partial, erroneous, missing or otherwise inaccurate as “data downtime.”

Currently, most companies do not have processes or technology to identify “dirty data.” Typically, someone must spot the error. Then the data platform or engineering team must manually identify the error, often by sampling data and doing parity matching, and fix it. It is error prone, time-consuming, tedious work (taking up to 80% of data scientists’ time), and it’s the problem data scientists complain about most.

High data quality is critical for companies to be able to depend on it, and there are numerous perils of bad data. While the caustic observation, “garbage-in, garbage-out” has plagued analytics and decision-making for generations, it carries a special warning for Machine Learning (ML) since the time it takes to develop a model is significant. If an ML engineer spends time training and serving a ML model built with bad data, the incorrect ML model will be ineffective in production and can have negative secondary implications for user experience and revenue. An O’Reilly survey found that those with mature AI practices (as measured by how long they’ve had models in production) cited a “lack of data or data quality issues” as the main bottleneck holding back further ML adoption.

Data quality is foundational to business’ human and machine decision making. Dirty data can result in incorrect values in dashboards and executive briefings. Importantly, once confidence in the data erodes, teams don’t trust the figures, which can lead to indecision slowing business progress. Additionally, we’ve heard about bad data leading to product development decisions that have caused corporations to lose millions of dollars in engineering effort. Machine-made decisions based on bad data can lead to biased or inaccurate actions.

https://profisee.com/data-quality-what-why-how-who/

The individuals responsible for managing data quality are typically VP of data, Director of data engineering, or the Chief Data Officer. We believe they are feeling the pain of data quality more now than ever because 1) the volumes of data are rapidly increasing; 2) environments are more complex affecting data flows and making it harder to pinpoint the step in the pipeline that has the issue; 3) architectures have changed significantly as businesses shift databases and storage to the cloud (e.g. Snowflake, Redshift, BigQuery, etc.), and 4) data ownership is becoming less clear as teams shift towards data meshes (distributed data ownership). With data meshes, leaders appreciate high data quality is imperative because data producer teams and data consumer teams have a contract between them to make sure the data is cleaned, catalogued, and reliable. Recognizing fixing data issues manually can lead to toil, delay, and error, these buyers are looking for automated solutions to mitigate their data quality challenges.

There are now a handful of startups building products to identify data issues and then help teams develop workflows to respond to those events. Products evaluate freshness, distribution, volume, schema, and lineage across all data of data sources and within them. They provide observability, troubleshooting, and incident response. Below we have identified 14 data quality solutions.

There are three axes we’ve broken down data quality offerings: 1) internal vs. external data, 2) rule-based vs. ML-based, and 3) pre-production vs. post-production data checking. Data quality vendors can focus on validating third party data before it enters internal systems or focus on internally generated data. To identify errors, some solutions take a rule-based approach while others apply ML to identify anomalies. Finally, data can be checked pre-production like table diffing or ETL regression testing when code changes or in production by checking the data distributions across databases.

We believe there will be convergence over time, with the winner supporting internal and external data quality as well as pre-production and production data validation and observability. We also hypothesize data quality companies will eventually move into data lineage and data cataloging over time.

Data quality solutions are becoming a core piece of the data stack. We are excited to watch as the ecosystem evolves with the new players taking a more data-centric approach. If you or someone you know is working on a data quality startup or adjacent offering, it would be great to hear from you. Comment below or email me at amyers@redpoint.com to let us know.

Data Quality — A Primer

Written by Astasia Myers