DQ Dimension Measurements are Meaningless

Sandesh Gawande
iceDQ
Published in
5 min readMar 27, 2023

Data Quality Dimensions are usually measured in Accuracy, Completeness, Consistency, Freshness, Validity, and Uniqueness. While these dimensions sound theoretically appealing; in the real world, measurements of such DQ dimensions are of little or no use. Is the data quality dimension report as shown below of any use? Let’s try to answer this question in this article. But keep in mind we are not talking about DQ concepts but measuring and reporting of DQ dimensions.

DQ Dimension Measurement Basic Issues:

You have a production CRM system. It has a CUSTOMER table with an EMAIL column. Of course, email is an important column. To measure the completeness of DQ dimensions for customer email the following query will be used.

Profiling Queries — Completeness dimensions measurement

Based on the above data, the Completeness Dimension measurement for CUSTOMER EMAIL_ID is COMPLETENESS =
TOTAL_COUNT / INCOMPLETE_COUNT = 88.88%

Now the million-dollar question: Is the fact above of any use?

Theoretically, yes. Now you know that 11.12% of the email IDs are empty. You can ask someone to find the reasons and fix them. Since CUSTOMER is master data, you can put this information to some use. As your team fixes the data you will expect the rate of completeness to increase over time. This will make you happy.

1st problem: The metric gave you a summary number but does not tell you the exact customer records that have issues. There is nothing that will help you take corrective action.

2nd problem: The EMAIL_IDs are null but maybe not all customers in the list are important. Maybe there is a source that does not and will not provide records with an email ID. So, DQ dimensions don’t take business context into consideration.

The CUSTOMER table will have additional columns such as FIRST_NAME, ZIP_CODE, STREET, FIRST_CONTACT_DATE, etc.

As we go one step up and aggregate the completeness measurement for each column in the CUSTOMER table and check the values, then you get a completeness value of 33%. What does it really mean? Should you worry about it? YES/NO? You will most likely say “yes” because the aggregate value is much less than the earlier value.

You can even aggregate the metrics across multiple tables and columns as nothing stops you from doing so.

3rd problem: Not all attributes in the CUSTOMER table are important from the business point of view. Aggregation of the metrics creates a confusing metric that looks good but is not actionable.

4th problem: Also, there is a meta question as to what the percentage % numbers really mean by themselves. Companies try to solve this issue by artificially creating a limit. Anything above 90% is good? What does that really mean? How did someone come up with that answer? I challenge anyone to come up with a logical reason for the number.

Let’s take some real-life use cases and see if these measurements are of any relevance to you.

Real-life Scenarios and Data Quality Dimensions:

Salary Calculation Example: Consider the case of a monthly salary process that takes inputs such as leaves, bonuses, and fixed salaries. If a data quality professional claims that the bonus data’s accuracy dimension has 99% then what does it really mean?

Should I stop the salary payments? Should I stop the salary of one or a few affected individuals? For the month is a bonus even in consideration?

You decide if you can really apply the DQ dimension metrics in such scenarios!

Healthcare Example: In the healthcare industry, data quality is critical for accurate reporting of diseases. For instance, incorrect patient data will result in the wrong treatment and will put the patient’s health at risk.

A company has 2 million patients in their network. They do at least one annual checkup with their physician and a lab. This checkup averages 5 tests. Each health condition is provided, and the ICD-10 code.

There is a minimum of 10,000,000 annual records (2,000,000 patients x 5 tests). Now Imagine my DQ expert measures DQ dimension and reports 99.99998% accuracy of data.

Should I be happy because there are only 2 bad patient records? Well, tell the CEO that when the patent sues the company for incorrect medical treatment.

Therefore, making a statement like “our data is 99% accurate” without any context is irrelevant, as the context is critical to understanding the impact of data quality on business processes.

Conclusion

One of iceDQ’s clients mentioned to us that DQ measurements are of no use to them, as in a healthcare world percentage points or thresholds of quality have zero meaning. It’s either a Go or No-go.

Therefore, at iceDQ, we believe it is important to consider the context while reporting data quality issues. It is not about the DQ dimension score but providing a Go, No-Go decision at the level of a business process or an entity. We believe one must connect business processes and entities to the data elements that affect the business outcome.

  1. Location: This dimension refers to the specific location of the data issue within the organization’s data. It can be at the entity level (such as a customer record or a product), record level (such as a specific instance of a customer or product), or data element level (such as a specific field within a record).
  2. Impacted business processes: This dimension refers to the business processes that are affected by the data issue. For example, a data issue in customer records may impact the sales process, customer service process, or billing process.
  3. Impacted Entity: In case customer data has issues related to client reporting. Now, does this impact all the customers or just a few of them? Otherwise, the whole process can come to a standstill. Report the actual entity that is affected and not just say stop everything.
  4. The magnitude of the problem: This dimension refers to the extent of the data issue. It can be measured in terms of the number of records or data elements impacted, the frequency of occurrence, or the financial impact on the organization.
  5. Criticality: This dimension refers to the importance of the impacted business process to the overall goals of the organization. For example, a data issue that impacts the organization’s ability to invoice customers accurately may be considered critical, while a data issue that impacts a less important process may be considered less critical.

By considering these data issue dimensions, organizations can prioritize their data quality efforts and focus on the issues that have the greatest impact on their business processes and overall goals.

--

--