Weighing the Value of Data Quality Checks
Written by: Nithin Santhanam
Our team has been exploring implementing data quality checks into our data platform. We’ve been trying to establish our goals, limits, and expectations, some of which were discussed in Part 1 of our Data Quality blog posts. We’ve traditionally used open-source tools in our data stack, and data quality has been an area where we’ve asked ourselves: “Is this a large enough and important enough area for us to invest in a data quality product? Or can we develop something that works for us in-house?” I wanted to explore how data quality would look on our platform and how we should approach the topic as a whole. Therefore, I wanted to do some small experiments to make small proof of concepts for myself to help validate some of our thinking. I would theoretically create a small in-house data quality tool and all the components I would need to consider. In this post, I discuss some of the various ideas, components, and findings discovered during my research.
Goals:
We want to execute data quality at scale on all of our datasets to ensure some baseline standards. We want to provide data scientists with understandable data quality metrics to give them confidence in using our data. We also want to use our data quality to help us identify pipeline or data issues that arise. Lastly, we want to be able to expand on any of our data quality tools more in the future, once more features are chosen, in a flexible and scalable manner. For our proof of concept, I decided to reduce the problem to looking into data completeness, which will perform a data quality check on the nulls present in the data. However, all the topics discussed below will apply to all potential data quality checks we’d hope to do with this proposed tool.
Problem Definitions and Questions:
Flexibility and Customization:
Data quality, while ideally being an objective measurement, has a lot of room for interpretation and preferences. For example, if we want to know the completeness of a table, we could go through every single column and find how many nulls exist in each column and record that. However, as most data scientists would know, oftentimes we only really care about a few of the most critical columns, and an irrelevant column of data that is full of almost entirely nulls may not impact our perception of the table data quality. For this reason, I decided to implement a feature in my experiment that is defined as “columns of interest.” The purpose of this was to narrow down the scope of which columns for any given table should influence the completeness data quality metric for that table. If the user chooses to, they can include all the table columns, and that’s their prerogative. When data, goals, or priorities change, columns of interest, or other metadata properties could change, we need to ensure we provide that flexibility to do so and anticipate the needs that could arise.
Actions to Take:
Additionally, we must determine the action we will take when we find inadequate data quality:
Potential Actions:
- Purely for monitoring purposes, observe and report data quality
- Monitoring and setting thresholds to notify parties of interest to abnormalities or scheduled reports
- Flagged for potential investigation through adding a flag column or adding low completeness data to some other bucket to be explored later
- Checked with subject matter experts
- Edited and fixed if possible on the spot
- Interpolating and using strongly educated guesses to edit and resolve the issue on the spot.
For my proof of concept, I chose to use option 1 of monitoring and reporting the data quality to a table, but all of these options are possible if we found valid reasons to execute them. My goal is to understand the aspects of the data quality process at this stage so reporting to a dashboard makes the most sense.
Ideal Data Quality Output:
As mentioned in our initial goals, one of our goals is to provide metrics to assist a data scientist in understanding and describing our dataset. What data quality output would help accomplish this goal and give a consumer confidence or understanding of the data? What do they need to know that will determine whether they choose to use the data, and how will this help us improve data?
Metrics to Report:
What level of detail do we want to report data quality metrics? For example, when reporting completeness, we could report the null counts for each column or report them as a single null count value that summarizes the entire dataset. We could sum all the nulls found divided by the number of columns we are looking at. Ideally, maybe we’d develop a visualization with drill-down options to where if a person wanted a data quality summary, they could do so. Still, they can also drill down to higher levels of granularity if the individual wants more detail on data quality at the column level. I believe it is important to have 1 or 2 summary-level metrics that describe the general properties of a dataset for easy understanding.
Example:
Assumption: col1-col6 are in our columns of interest.
For the sample data above, we could report the completeness of each of these columns and potentially in cases with even more columns, report every one:
col1: 100%, col2: 100%, col3: 20%, col4: 60%, col5: 100%
Or we could give a completeness summary of the entire table:
col1_nulls + col2_nulls + col3_nulls + col4_nulls + col5_nulls + col6_nulls =total nulls
completeness for table = (total cells-total nulls)/total cells = (30–8)/30 = 23/30 = 0.767
Both metric philosophies have their uses and applications and describe the data from different perspectives. For example, big data cases will have much larger amounts of rows and/or columns and so nulls found here and there will be relatively negligible compared to the small example above.
Where in the Pipeline would Data Quality Checks best fit?
Our first option was to integrate the data quality checks into the data processing pipeline. Great Expectations is a commonly known framework with a lot of support that operates within the processing pipeline. It has a lot of flexibility on several dimensions and is a good template to follow when creating our own tool. I created a small job that would calculate the total values required for data quality calculations for each date partition. A user would define the columns of interest for them to determine which columns were important to the data quality. For a specific date, I counted the number of nulls found, the number of rows in that partition, and all relevant metadata like: date partition, dataset name, table name, and uuid. These calculations would run on the data for every date partition under a table and accumulate values for each into a single table. The presence of the uuid column would make it so if data quality for any of these partitions changed from a data update or insert; we could rerun that specific partition and overwrite the new values into the table. We would then be able to aggregate these values into completeness data quality metrics for whichever date range by summing the total nulls and total rows in that date range and getting the percent completeness which would be displayed on a dashboard. This data quality table would be query-able in itself for someone to explore or it could be the backend data to a dashboard and reduce the amount of real-time calculation needed when calculating the metrics.
To demonstrate the reasoning for the data quality architecture, refer to the figure below:
The example above represents completeness data quality data stored in a cloud storage service. As shown, each partition of data has its own data quality total values calculated, while the data was processed and written to a table with a unique identifier (uuid) referring to that date partition, along with other relevant metadata. If an update is found on an existing partition, we append the update to the table as the latest entry for that uuid. We will then be able to track the data quality for a specific date partition over time if we explore that uuid, or we can retrieve the latest update for all partitions and ignore the rest of the historical updates that have accumulated over time. Since all of these date partitions have their own identifier and values, we can sum the values across any date of interest or date ranges to accumulate the aggregate value of that date range as a whole. In doing it this way, we’ve calculated all the simple aspects already, and a querying tool would only be left with simple operations like addition and division which, reduces potential real-time query delay in running more complex queries. The example above shows the data quality calculation if we wanted to determine the data quality between 06/23/20 and 06/27/20, but we could have chosen any date(s) or date ranges. This same logic is true for any metric we implement or if we choose to calculate completeness by column instead of as a whole for the table. Instead of the value “nulls” in the table, it would show col1_nulls, col2_nulls, etc., and we could aggregate those values across all date partitions of interest. This general framework accomplishes the goals we set out to accomplish in a flexible, scalable way while reducing dashboard computation.
In a table, the data stored above would look something like the following:
or this if we choose to the completeness by column approach…
Using the values accumulated in these tables, we could potentially generate data quality scores that summarize the data quality in a partition or across date ranges. The exact score algorithm can be unique to the dataset, table, or both. We can easily use the tables above to do simple aggregations. In the case of the summary metrics in the first table, we can compare data quality across tables and datasets. Additionally, even if we add columns, we can easily integrate them into this process with minimal effort by adding them to our columns of interest.
Our second deployment option is to query our database using Presto and execute our data quality outside the actual data processing. Many current data quality products use Presto to create data quality queries for any pertinent metrics they want. Presto is a querying technology is widely used in many big tech company data stacks, so many products are designed to be implemented alongside it. There is a lot of logic on why to do it this way. Presto is the backend of many dashboarding technologies and enables customizable queries in a distributed, quick manner. You can then output the query results to dashboards or write the output to storage at some location. For this post, I decided to leave this deployment option alone and focus on a proof of concept of the first deployment option because of the presence of mainstream products and my overall solid understanding of how it would work.
General Deployment Options Summary:
Deploy a job that will create a query-able data quality dataset that would be accumulated over time:
- The job executes on every day’s data and records totals on that day to be accumulated later in dashboarding
- Use Primary keys to overwrite existing data on a particular day if any day’s data is re-run
- Containerized Job that can deploy in batch or streaming context
- Can be deployed when a date partition is updated or on a set schedule
- Benefits of having tangible ready-to-go Data Quality datasets?
- Great Expectations is an example of data quality that runs on actual data itself
Query Data Quality calculations on existing datasets through Presto and dashboarding tools?
- No explicitly defined data quality dataset, but pretty lightweight
- No additional storage is required for the data quality dataset
- We can run these queries as often as we want and record the output in order to log values over time
- Running in real-time will retrieve values of the current version of the dataset so keeping track of updates isn’t required
- Write output to a dashboard or some metadata collection location
- Presto has connections to many existing Data Quality products that could be implemented
Closing Thoughts:
As stated at the beginning, the exploration and problem defining were not intended to come to any definite conclusions immediately. We hope that continuing exploration of data quality ideas and architecture will help us determine what philosophies, definitions, expectations, and technologies we want to implement into our data platform. I went through the process of creating a small data quality job that would execute data quality calculations on data in our data lake. I tried to think through the job structure that would be needed to provide the maximum flexibility while accomplishing all or most of our goals. I believe that I was able to create an adequate data quality dataset that would help give us insight. I think the framework of accumulated small units of values that can later be easily consolidated into meaningful insight is valuable, and I’d like to potentially use it in other contexts. Particularly when large amounts of data are fed to the model and the developer wants to weigh each of these values differently or in a similar case, I believe this framework has value. From working with dashboards and tools that require complex or large queries to run, I’ve seen these dashboards take a decent amount of time to run and I believe reducing the amount of work needed to be done during user interaction is important. Every data use case and application is a little different, and the context will matter in determining the best course of action.
Ultimately, I can’t entirely justify why you would want to dedicate a portion of the pipeline to create an additional dataset for data quality in this manner. We would be creating a new data quality dataset for every data table we have with minimal additional gain. I think it would make more sense to have a tool that would query the dataset using Presto and execute data quality checks that way. If we want to create an in-house tool, I believe it should run on the data using presto as opposed to reading the actual data itself. Logistically, the developers would need to figure out how they would like to log the data quality and have some way of being able to look back historically through the data quality trends in their data.
In terms of the metrics we want to provide, one thing our team will look into doing will be to discuss with different groups of data scientists what they are looking for from data quality and what types of metrics they’d feel accurately represented those desires.
Through this small experiment, our team gained some insight into the various aspects of data quality that we were exploring. Hopefully, we can continue to build on them and determine the exact data quality architecture that we want to integrate into our platform. Appreciate you reading through this part of our journey, and we would love to hear any feedback that anyone has. And as always, be sure to follow 99P Labs to stay up to date on all our research.