Four Shades of Data Quality

Uzi Blum
Team Taxfix
Published in
6 min readMar 28, 2022

VP of Data, Uzi Blum, explains why data users might not trust their data, the four layers of data quality, and some best practice actions to bring those layers to a higher level of quality.

We are facing a common problem

If you’ve been working with users who consume data, you’ve probably heard statements such as, “There’s a problem with the quality of the data”, “I don’t trust the numbers I see”, or “The numbers are not reliable”. As data professionals, we work very hard to make data available or build sophisticated data analyses. However, if there is no trust in the data and the end results, our users will eventually not use it and we would have wasted a lot of effort for nothing.

Reason for gaps in data trust

I will be using the term ‘data user’ in this article, so let me first define what a data user is. A data user is anyone who interacts with data. They are mostly business users who consume reports or insights to make decisions. Those business users can also be more autonomous by creating their own analyses via drag and drop interfaces or even writing SQL code. A data user is also a data professional such as a data engineer or a data analyst who processes data, builds data assets, or creates dashboards and insights to be consumed by business users.

There are several common complaints we hear from data users:

  • “It looks like the data is not complete, but I do not have a way to know if this is a technical problem or a business problem.”
  • “The numbers we see in one report are not consistent with another report”
  • “Yesterday’s numbers are missing.”
  • “The definition of the metric or calculation is wrong” or “we have a different definition for that attribute” (for example, ‘Users’, ‘Customers’, ‘Visitors’, ‘Paying Customers’, ‘Net Revenue’).

Not all of these problems originate from the same source, and fixing only part of those challenges will not increase trust for a long period of time. We are facing an “AND” problem, and even problematic data quality in one single layer is enough to label all of our efforts as “Unreliable Data”.

As data professionals and leaders, it is time we understand the different layers of data quality and start handling them in a holistic way.

The four layers of data quality

There are likely many approaches for handling data quality holistically; here’s my four-layer method:

1 — Row layer

The row-level data represent data from a single record point of view and focus on specific attributes of the records and their validity. Data quality issues in this layer are often caused by how the data arrives from the source system and the stability of data integration.

Actions: In this layer, our main activities focus on monitoring the data, correcting values, and filtering problematic records. Here are a few common actions:

  1. Validate that a specific column meets a certain set of expected values. Notice any null or empty fields and consider how to treat those cases (ignore, remove, or replace with default values see below).
  2. Validate if the column meets a certain number of characters. In cases containing manual input, use a lookup for matching names to correct typos (i.e. Goggle → Google).
  3. If some of those values do not meet the set of rules previously defined, one can complete values with default values, trim values, adjust to best-fitting value, complete based on certain rules coming from other attributes, or filter out the record.
  4. Monitoring and alerts can be sent, ideally to the data source owner, to be fixed by the source owner.

A holistic approach of row data validation is composed of both: a) real data monitoring and b) unit tests on mock data. The unit test covers all different scenarios and allows us to simulate scenarios that do not exist yet in our production data. The challenge in unit tests is that data needs to be prepared and mocked. Real data testing confronts reality and will alert or take action on data scenarios that are not always under our control. As unit tests are more complex to compose and create, I recommend using them when truly needed in more complex logics and transformations.

2 — Processed layer

We typically refer to the processed layer as a view of several records and their behaviour in their aggregated form. Data quality issues can arise from an issue in the data processing logic and systems stability and business logic changes that were not incorporated into the process, such as how to aggregate the data or changes in the granularity.

Actions: In this layer, one can monitor the data, filter unexpected results, and fix the data on the fly. Here are some common issues data teams can monitor and/or fix:

  1. Duplication of records or granularity key
  2. Source and target do not have the same aggregated values
  3. Different source systems do not yield the same results
  4. Referential integrity issue, where a primary key is missing while a related foreign key exists
  5. Data is not available on time or the full data set is not complete (in case of batch processing)

It’s advisable to have a standard set of quality tests for every new model and data processing should be included before going live. Some advanced organisations have tools that can automatically cover a set of standard validation rules.

When building a solution to monitor or fix data quality issues, try to take action before it reaches the end-users. However, if this is not possible, making sure the business users are aware of any issues with the data is essential to avoid wrong decisions or confusion. Not alerting business users proactively on data reliability or availability is one of the causes for losing trust in the data team. Some common practices are providing daily feedback that the data is available to be used or an indication on when was the last time the data was updated.

3 — Metric Layer

The metric layer focuses on the company’s metrics, definitions, understanding, and volatility control.

Actions: In this layer, the main activities are:

  1. Establish a well-defined glossary for all the company’s metrics — For many years, I used to call our glossary a data dictionary until our Data Product Manager corrected me. I will dedicate another post to discussing why and how to build a glossary.
  2. Set business alerts to monitor business logic — For example, the range of total activities or revenue in a day needs to be below or above certain thresholds. Another example is that a specific metric trend cannot increase more than x% in a certain period of time.
  3. Education of data users on how data is being processed (to identify potential issues) and the existence of the glossary and how to use it.

4 — Context Layer

The Context layer focuses on how the data user interacts with the data. In this case, imagine a data user reviewing a report showing the monthly active users (MAUs) for a specific country or only for new customers who joined this year. The graph shows MAUs, but without carefully reading the title or the filters — assuming these are visible — one can easily get confused as to why this report shows different MAUs compared to the country-specific report.

Actions: In this layer, it’s all about education and standards. For instance:

  1. Set a standard for creating dashboards and reports that provide clear context. For example, a dashboard should contain certain titles and info icons explaining the segment and main filter. Another example is to make sure the filters are visible and apply to all relevant charts.
  2. Education — educate the data users on “how to read the data/report”. For example, by making sure they know the context and hidden filters.
  3. Data Source visibility — Make sure the data users have visibility on the data source quality. Is it based on the verified centralised DWH or on a “quick and dirty” row analysis that did not pass the previous layers’ quality check?

Final thoughts

In summary, data quality and data trust are exposed to challenges from different levels. Based on your role and responsibilities, whether you are Data Engineers, Analytics Engineers, Data Analysts or Data Product Managers, your scope and focus of data quality might be limited to a single layer. However, even if you put a lot of effort into one layer and make it very solid, I suggest you expand your scope of interest and encourage other data layer owners to do the same. We all need to be aware of the many shades of data quality to produce actionable insights.

Learn more about our open Data & Insights roles on our career site.

--

--

Uzi Blum
Team Taxfix

Enthusiast about innovation and great ideas. Data driven everything