Timeless Obstacle for Data Products: Data Quality

Seckin Dinc
Data And Beyond
Published in
5 min readMar 21, 2023
Photo by Etienne Girardet on Unsplash

Data products are the future! Today we are surrounded by various AI and ML-generated data products supporting our lives. We are in a stage where we can’t live without them. While we are too dependent on them, the quality of the services becomes vital; e.g. autonomous cars.

Whether we serve data as a product or use data to build data products, it doesn’t change the fact that the main ingredient of the data product is always the data. In this sense, the quality of the data defines the quality of the product we serve.

In most scenarios, we don’t think about data quality measures before building the data products. If we are lucky enough, we can detect the known-unknown data quality issues before it is too late. But if we are unlucky or not prepared properly, we may not have the chance to react before it is too late. History holds good examples of catastrophic data quality mistakes.

What is data quality?

Data quality refers to the development and implementation of activities that apply quality management techniques to data in order to ensure the data is fit to serve the specific needs of an organization in a particular context. Data that is deemed fit for its intended purpose is considered high-quality data.‍ I love analogies to explain complex topics. Let’s see data quality and how to detect it in a real-life example.

Detecting data quality issues is so similar to the health systems. First, you detect some symptoms that are outliers to your baseline system; e.g. waking up tired in the last 2 weeks. With this knowledge, you go to your house doctor. Your house doctor asks you questions to generate more data points for the relevant metrics to identify which part of your body doesn’t work properly; e.g. headaches in the mornings, being thirsty all the time. Then the doctor tries to understand your baselines, your habits, and the recent changes in your life to identify the root cause of the symptoms; e.g. you bought PlayStation 5 three weeks ago which ended up with eating and drinking late and sleepless nights. With this knowledge, the doctor puts a diagnosis that you are damaging your digestion system which impacts your kidney and liver and causes the initial symptoms. In order to have a healthy and quality life, you need to fix your Play Station 5 — life balance.

Data quality metrics

Every successful business, product, and organization has its core metrics to evaluate the ongoing processes. Without the metrics and continuous monitoring, we are forced to walk in the dark without any direction we are heading to.

In this regard, we should monitor critical data quality metrics in our businesses. Below you can find the most common data quality metrics;

Accuracy refers to the correctness of the data. For example, the monthly sales numbers at Salesforce and the data warehouse are not matching accurately.

Completeness refers to whether all the required data elements are present. For example, the sales data you collect from countries have a completeness problem for specific regions during the weekends.

Consistency refers to whether the data is internally coherent and consistent across different sources. For example, customers’ email opted-in preferences are different in different databases causing inconsistent service.

Timeliness refers to whether the data is up-to-date. For example, your warehouse system requires near-real-time storage information but the data is shared with a one-day delay.

Uniqueness refers to whether the data is unique. For example, your data warehouse holds different information for the same customer which creates uniqueness problems.

Validity refers to whether the data accurately represents the intended concept or measurement that it is supposed to represent. For example, customer email addresses don’t have “@” characters.

Known-unknown and unknown-unknown data quality issues

Even though we know which metrics we need to monitor, we can’t easily know which issues affected the metrics to change. It is not a straightforward question to be answered. Basically, we can group the data quality issues into two categories;

Known-unknown data quality issues refer to problems that are either known or can be predicted to a certain degree. During the development stage, we write various types of tests to detect the problems. Afterward, we create rule-based systems to generate alarms. The best strategy to deal with this problem space is to identify data quality metrics and continuously collect and monitor them.

Unknown-unknown data quality issues are the most dangerous ones since we don’t have any information about them and they are not predictable at all. While we are dealing with nonpredictable issues, the best tool in our hands is our baselines. If our data points start to drift away from baselines, it is worth deep diving into them. The best strategy to deal with this problem space is to set up data observability solutions in place.

Data quality strategy: prevention vs detection

Implementing company-wise data quality requires a mindset shift from data producers to C-levels. It is not a simple task that you can add to a couple of teams’ sprints and get over it for eternity. Data quality measures should be set and aligned together with all the data producers, and then all the data consumers should be trained on how to tackle these measures.

Even though the whole organization is ready to apply these measures there is a big question, how are you going to prioritize the implementation process? Are you going to start with prevention or detection?

Data quality prevention refers to the process of proactively designing and implementing measures to prevent errors, inconsistencies, and other issues from occurring in the first place, rather than just detecting and correcting them after they occur. It involves implementing best practices and procedures to ensure that data is accurate, complete, consistent, and reliable from the outset.

Data quality detection refers to the process of assessing the accuracy, completeness, consistency, and reliability of data in a dataset. It involves identifying errors, inconsistencies, and missing values in data and taking appropriate measures to rectify them. The goal of data quality detection is to ensure that data is fit for use and can be trusted for decision-making purposes. Data quality detection typically involves several steps, including data profiling, data cleansing, and data validation.

Conclusion

With this article, I set up the foundations of the data quality topic. In the upcoming articles, I will walk you through the great products, tools, and libraries that exist in the data domain to solve data profiling, data quality, data observability, and more problems in data quality.

Thanks a lot for reading 🙏

If you liked the article, check out my other articles.

If you want to get in touch, you can find me on Linkedin and Mentoring Club!

--

--

Seckin Dinc
Data And Beyond

Building successful data teams to develop great data products