Seven Dimensions of Data Quality

Jean-Georges Perrin
AbeaData
Published in
4 min readMay 6, 2024

--

In this reference article, I wanted to go back on data quality and list its seven dimensions, giving examples to make them palatable and some fun facts.

I hope this article will become a collective work with more examples and anecdotes, so use the comment section.

The seven dimensions of Data Quality is also an #AntiMemeMonday card.

Back in 2017, I gave my first talk about data quality at Spark Summit in San Francisco, CA. I introduced Cactar (Consistency, Accuracy, Completeness, Timeliness, Accessibility, and Reliability) as an acronym for six data quality dimensions relayed in this Medium article. Although there is no official standard, the EDM Council added a 7th one. Check out the cheat sheet card on LinkedIn.

Here are the seven data quality dimensions.

Accuracy (Ac)

The measurement of the veracity of data to its authoritative source: the data is provided but incorrect. Accuracy refers to how precise data is, and it can be assessed by comparing it to the original documents and trusted sources or confirming it against business rules.

Examples:

  • Fractional quantities are rounded, losing some precisions.
  • A customer is 24 years old, but the system identifies them as 42 years old.
  • A supplier address is valid, but it is not their address.

Fun fact: a lot of accuracy problems come from the data input. If you have data entry people on your team, reward them for accuracy, not only speed!

Completeness (Cp)

Data is required to be populated with a value (aka not null, not nullable, required). Completeness checks if all necessary data attributes are present in the dataset.

Examples:

  • A missing invoice number when it is required by business rules or law.
  • A record with missing attributes.
  • A missing expiration month in a credit card number.

Fun fact: a primary key is always a required field.

Conformity (Cf)

Data content must align with required standards, syntax (format, type, range), or permissible domain values. Conformity assesses how closely data adheres to standards, whether internal, external, or industry-wide.

Examples:

  • The customer identifier must be five characters long.
  • The customer address type must be in the list of governed address types.
  • Supplier address is filled with text but not an identifying address (invalid state/province, postal codes, country, etc.).
  • Invalid ISO country codes.

Fun fact: ISO country codes are 2 or 3 digits (like FR and FRA for France). If you mix up the two in the same datasets, it’s not a conformity problem; it’s a consistency problem.

Consistency (Cs)

Data should retain consistent content across data stores. Consistency ensures that data values, formats, and definitions in one group match those in another group.

Examples:

  • Numeric formats converted to characters in a dump.
  • Within the same feed, some records have invalid data formats.
  • Revenues are calculated differently in different data stores.
  • String are shortened from a max length of 255 to 32 when they go from the website to the warehouse system.

Fun fact: I was born in France on 05/10/1971, but I am a Libra (October). When expressed as strings, date formats are transformed through a localization filter. So, being born on October 5th makes my date representation 05/10/1971 in Europe, but 10/05/1971 in the U.S

Coverage (Cv)

All records are contained in a data store or data source. Coverage relates to the extent and availability of data present but absent from a dataset.

Examples:

  • Every customer must be stored in the Customer database.
  • The replicated database has missing rows or columns from the source.

Timeliness (Tm)

The data must represent current conditions; the data is available and can be used when needed. Timeliness gauges how well data reflects current market/business conditions and its availability when needed

Examples:

  • A file delivered too late or a source table not fully updated for a business process or operation.
  • A credit rating change was not updated on the day it was issued.
  • An address is not up to date for a physical mailing.

Fun fact: Forty-five million Americans change addresses every year.

Uniqueness (Uq)

How much data can be duplicated? It supports the idea that no record or attribute is recorded more than once. Uniqueness means each record and attribute should be one-of-a-kind, aiming for a single, unique data entry (yeah, one can dream, right?).

Examples:

  • Two instances of the same customer, product, or partner with different identifiers or spelling.
  • A share is represented as equity and debt in the same database.

Fun fact: data replication is not bad per se; involuntary data replication is!

Going further

Let’s agree that those seven dimensions are pretty well-rounded. As an industry, it’s probably time to say: good enough.

However, we can always add examples, so please add your examples and fun facts in the comment section to enrich this article. I'll add the best ones to the articles in future updates!

--

--

Jean-Georges Perrin
AbeaData

#Knowledge = 𝑓 ( ∑(#SmallData, #BigData), #DataScience U #AI, #Software ). Lifetime #IBMChampion. #KeepLearning. @ http://jgp.ai