Implementing Data Quality at Scale: Investigating Validation & Testing for Large Data Sets

Published in

99P Labs

6 min readNov 18, 2022

Validation is a critical step in data processing pipelines. Following ingestion and one or more transformations, it is imperative that we validate our expectations of the data sets that are produced by these pipelines. But how exactly can we test our data to ensure its validity? What kinds of checks should there be? And at what points should we run these checks?

Additionally, what if our data set is both large and continuously streaming new data? How does that impact our data quality monitoring and measurement efforts? These are some of the questions we will investigate and attempt to answer.

This article is the first in a series on implementing data quality checks at scale. Be sure to follow us to stay up-to-date with this series and others from the team at 99P Labs.

What exactly are data quality and validation? Why is this important?

Data quality is a measurement of the condition of a given data set, based on factors like accuracy, completeness, consistency, timeliness, and reliability. It measures how well-suited a data set is to serve its intended purpose.

Validation, in this context, refers to testing data outputs to confirm that they accurately represent the constructs we intend to measure. This includes confirming our expectations of the data shape, structure, and completeness.

But why are data quality and validation so important? Well, for one, erroneous data can cost businesses a lot of money. IBM has estimated that poor data quality costs businesses $9.7 million per year on average, and up to $3.1 trillion overall in the US alone. Other estimates place this figure at roughly 8–12% of total annual revenue.

Organizations typically find data error rates between 1% and 5%, but this figure can be as high as 30% (or more) for others. To calculate the data error rate, we take the number of fields where errors were observed and simply divide by the total number of fields under test:

Data Error Rate is equal to the number of fields with errors divided by the total number of fields under test — Calculating the total data error rate

Poor data quality is costly. But there is a cost to bad quality that can’t truly be quantified — trust. Bad data can lead to spurious insights, which in turn can have disastrous consequences stemming from decisions that are made based on those insights.

Establishing trust is difficult. It is even harder to win back once it has been damaged. Consider the average cost of a data breach, which in 2022 is $9.44 million (in the US.) While this figure includes legal and regulatory fees, technical activities, as well as other factors like customer turnover, it still cannot represent the entire cost of that breach of trust.

Aside from financial and reputational costs, bad data costs time and creates a negative impact on analytic efforts by introducing errors into data models. At a minimum, such errors introduce extra roadblocks that the data team must spend time troubleshooting and correcting. At worst, this can lead to the aforementioned spurious insights and consequently ill-informed decision-making.

Data quality and validation are critical because bad data costs time, money, and trust.

What have we done so far?

In previous posts, we have explored possible data quality frameworks and metrics, including running some experiments using open source tools like Great Expectations. Our last post about data quality frameworks, Weighing the Value of Data Quality Checks, surfaced several key findings and recommendations.

Our primary consideration for a data quality framework, based on research done so far, remains that our output should be useful, easy to understand, and easy to access for our stakeholders. This would look like a set of meaningful metrics that users of our data could view to quickly determine quality at both the table and column levels. The ideal output would describe the quality of the given data at a glance.

With respect to tools, the consensus so far is to leverage Trino — a distributed query engine — over our data lake to create queries that would profile our data and perform validation checks. Great Expectations is quickly becoming ubiquitous when it comes to data quality monitoring tools, but the jury is still out as to whether or not it’s the best fit for our needs. We found it to be a good tool for automating data quality checks in general, but we’re still unsure whether dealing with the overhead that it comes with makes sense for us. Since it is still under active development, we may revisit its use in future posts.

While we have yet to implement a data quality framework, our research so far has helped shape the direction that we want to go and has given us an idea of what the final product should include.

What challenges are we currently facing?

From a technical standpoint, our data volume presents the biggest challenge to overcome at this time. This creates difficulties with profiling the data and performing even basic validation checks, such as checking for the presence of null values or duplicate rows.

The velocity with which we ingest new data also presents challenges to implementing an effective data quality framework. Our telematics data streams in near-real time, which means that any potential framework would need to accommodate the rate of ingestion and processing.

Another challenge is understanding the data from a semantic perspective well enough to create reasonable expectations, which could then be validated. Data quality is a team sport, and we will require help from domain experts to establish reasonable and effective data validation checks.

How are we planning to overcome these challenges?

The first step in implementing a data quality framework would be to enable data profiling. Knowing more about the properties and distributions of our data will be critical in establishing baseline measurements, as well as continued monitoring. This is a prerequisite for implementing validation and testing.

To overcome the challenge of profiling a large volume of data, we will use statistical sampling techniques to select a representative subset of our data to profile. As mentioned earlier, Trino will be leveraged to perform the profiling due to its ability to query large amounts of data in a distributed manner. To deal with data velocity issues, we will investigate partitioning methods to include as part of our overall sampling strategy.

Once we have implemented a profiler, we will then work on creating our validation checks. There are general validation checks that can be used across all (or most) of our data sets, particularly those that focus on validating structure and completeness. However, this stage will largely be a collaborative endeavor, as we will need the help of domain experts and other knowledgeable parties to create checks that are specific to a given data set.

What’s next?

Our next step is to enable data profiling which, as mentioned earlier, is a prerequisite for implementing a data quality framework. First, we will explore sampling and partitioning methods for large data sets to help us overcome volume and velocity issues. Then we will specify and implement our data profiler, leveraging Trino and other distributed technologies.

Final Thoughts

Data quality and validation are important because poor data costs time, money, and trust. To test our data and ensure validity requires knowledge of the characteristics of the data (via profiling) as well as its semantic qualities (which can be uncovered with the help from domain experts.) Given both the volume and velocity of our data, it is imperative to leverage both statistical sampling and distributed computing to efficiently profile and validate it.

We hope you have found this informative and that you’ll continue to follow along as we work on implementing a data profiler for a large scale data set in the next part of this series!