What is Data QoS, and why is it critical?

Jean-Georges Perrin
ProfitOptics
Published in
9 min readOct 3, 2023

--

In this article, I am introducing the notion of data quality of service (Data QoS), which is the result of combining Data Quality (DQ) with Service-Level Agreements (SLA). I will start by explaining the concept, and I will then drill down to describe the elements composing the Data QoS, focusing first on Data Quality and then on Service-Level Indicators. Finally, I will explain how I grouped them.

Quality of Service (QoS) is a well established concept in network engineering. QoS is the measurement of the overall performance of a service, such as a telephony, computer network, or cloud computing service, particularly the performance seen by the network users. In networking, several criteria are considered to quantitatively measure the QoS, such as packet loss, bit rate, throughput, transmission delay, availability, and more. This article applies QoS to data engineering.

Data Quality of Service (Data QoS)

As your need for observing your data grows with the maturity of your business, you will realize that the number of attributes you want to measure will bring more complexity than simplicity. That's why, back in 2021, I came up with the idea to combine both into a single table, inspired by Mandeleev's (and many others) work on classifying atomic elements in physics.

Inspired by Mandeleev's periodic table for classifying atomic elements, the Data QoS table represents the finest elements used for measuring data quality and service levels for data.

Let's have a look at data quality first.

Data Quality is not enough

Regarding data, the industry standard for trust has often been limited to data quality.

I felt for a long time. In 2017, at Spark Summit, I introduced Cactar (Consistency, Accuracy, Completeness, Timeliness, Accessibility, and Reliability) as an acronym for six data quality dimensions relayed in this Medium article. Although there is no official standard, the EDM Council added a 7th one.

Here are the seven data quality dimensions.

The seven data quality dimensions are on the Data QoS table.

Accuracy (Ac)

The measurement of the veracity of data to its authoritative source: the data is provided but incorrect. Accuracy refers to how precise data is, and it can be assessed by comparing it to the original documents and trusted sources or confirming it against business rules.

Examples:

  • A customer is 24 years old, but the system identifies them as 42 years old.
  • A supplier address is valid, but it is not their address.
  • Fractional quantities are rounded up or down.

Fun fact: a lot of accuracy problems come from the data input. If you have data entry people on your team, reward them for accuracy, not only speed!

Completeness (Cp)

Data is required to be populated with a value (aka not null, not nullable). Completeness checks if all necessary data attributes are present in the dataset.

Examples:

  • A missing invoice number when it is required by business rules or law.
  • A record with missing attributes.
  • A missing expiration month in a credit card number.

Fun fact: a primary key is always a required field.

Conformity (Cf)

Data content must align with required standards, syntax (format, type, range), or permissible domain values. Conformity assesses how closely data adheres to standards, whether internal, external, or industry-wide.

Examples:

  • The customer identifier must be five characters long.
  • The customer address type must be in the list of governed address types.
  • Merchant address is filled with text but not an identifying address (invalid state/province, postal codes, country, etc.).
  • Invalid ISO country codes.

Fun fact: ISO country codes are 2 or 3 digits (like FR and FRA for France). If you mix up the two in the same datasets, it's not a conformity problem; it's a consistency problem.

Consistency (Cs)

Data should retain consistent content across data stores. Consistency ensures that data values, formats, and definitions in one group match those in another group.

Examples:

  • Numeric formats converted to characters in a dump.
  • Within the same feed, some records have invalid data formats.
  • Revenues are calculated differently in different data stores.
  • String are shortened from a max length of 255 to 32 when they go from the website to the warehouse system.

Fun fact: I was born in France on 05/10/1971, but I am a Libra (October). When expressed as strings, date formats are transformed through a localization filter. So, being born on October 5th makes my date representation 05/10/1971 in Europe, but 10/05/1971 in the U.S

Coverage (Cv)

All records are contained in a data store or data source. Coverage relates to the extent and availability of data present but absent from a dataset.

Examples:

  • Every customer must be stored in the Customer database.
  • The replicated database has missing rows or columns from the source.

Timeliness (Tm)

The data must represent current conditions; the data is available and can be used when needed. Timeliness gauges how well data reflects current market/business conditions and its availability when needed

Examples:

  • A file delivered too late or a source table not fully updated for a business process or operation.
  • A credit rating change was not updated on the day it was issued.
  • An address is not up to date for a physical mailing.

Fun fact: Forty-five million Americans change addresses every year.

Uniqueness (Uq)

How much data can be duplicated? It supports the idea that no record or attribute is recorded more than once. Uniqueness means each record and attribute should be one-of-a-kind, aiming for a single, unique data entry (yeah, one can dream, right?).

Examples:

  • Two instances of the same customer, product, or partner with different identifiers or spelling.
  • A share is represented as equity and debt in the same database.

Fun fact: data replication is not bad per se; involuntary data replication is!

Let's agree that those seven dimensions are pretty well-rounded. As an industry, it's probably time to say: good enough. Of course, it completely ruins my Cactar acronym (and its great backstory).

But I still feel it is not enough. Data quality does not answer questions about end-of-life, retention period, and time to repair when broken. Let's look at service levels.

Service-levels complement quality

As much as data quality describes the condition of the data, service levels will give you precious information on the expectations around availability, the condition, and more.

Here is a list of service-level indicators that can be applied to your data and its delivery. You will have to set some objectives (service-level objectives or SLO) for your production systems and agree with your users and their expectations (set service-level agreements or SLA).

The service levels on the Data QoS table.

Availability (Av)

In simple terms, the question is: Is my database accessible? A data source may become inaccessible for various reasons, such as server issues or network interruptions. The fundamental requirement is for the database to respond affirmatively when you use the JDBC’s connect() method.

Throughput (Th)

Throughput is about how fast I can access the data. It can measured in bytes or records by unit of time.

Error rate (Er)

How often will your data have errors, and over what period? What is your tolerance for those errors?

General availability (Ga)

In software and product management, general availability means the product is now ready for public use, fully functional, stable, and supported. Here, it applies to when the data will be available for consumption. If your consumers require it, it can be a date associated with a specific version (alpha, beta, v1.0.0, v1.2.0…).

End of support (Es)

The date at which your product will not have support anymore.

For data, it means that the data may still be available after this date, but if you have an issue with it, you won’t be offered a fix. It also means that you, as a consumer, will expect a replacement version.

Fun fact: Windows 10 is supported until October 14, 2025.

End of life (El)

The date at which your product will not be available anymore. No support, no access. Rien. Nothing. Nada. Nichts.

For data, this means that the connection will fail or the file will not be available. It can also be that the contract with an external data provider has ended.

Fun fact: Google Plus was shut down in April 2019. You can’t access anything from Google's social network after this date.

Retention (Re)

How long are we keeping the records and documents? There is nothing extraordinary here, as with most service-level indicators, it can vary by use case and legal constraints.

Frequency (of update) (Fy)

How often is your data updated? Daily? Weekly? Monthly? A linked indicator to this frequency is the time of availability, which applies well to daily batch updates.

Latency (Ly)

Measures the time between the production of the data and its availability for consumption.

Time to detect (an issue) (Td)

How fast can you detect a problem? Sometimes, a problem can be breaking, like your car not starting on a cold morning or slow, like data feeding your SEC (Security Commission for Publicly Traded Companies) being wrong for several months. How fast do you guarantee the detection of the problem? You can also see this service-level indicator called "failure detection time."

Fun fact: squirrels (or another similar creature) ate the gas line on my wife's car. We detected the problem as the gauge went down quickly, even for a few miles. Do you even drive the car to the mechanic?

Time to notify (Tn)

Once you see a problem, how much time do you need to notify your users? This is, of course, assuming you know your users.

Time to repair (Tr)

How long do you need to fix the issue once it is detected? This is a very common metric for network operators running backbone-level fiber networks.

Of course, there are a lot more service-level indicators that will come over time. Agreements follow indicators; agreements can include penalties. You see that the description of the service can become very complex.

Representation

To represent the elements, I needed to identify precisely each element on two axes:

  • Time (or period).
  • Group.

Each element received additional attributes, as shown in the following illustration.

Each element has a name, an abbreviation, a group, an order in this group, and a category.

Periods & time-related

The periods are time-sensitive elements. Some elements are pretty obvious, as "end of life" is definitely after "general availability."

General availability comes before the end of support, which comes before the end of life.

Classification of some elements is more subtle: when data comes to your new data store, you will check accuracy before consistency, and you can check uniqueness only when you have significant data. The elements have no chronological link, but they happen in sequence.

Checking accuracy, consistency, and uniqueness happens in sequence.

Grouping

The second classification to find was about grouping. How can we group those elements? Is there a logical relation between the elements that would make sense?

This is what I came up with:

  • Data at rest (R).
  • Data in motion (M).
  • Performance (P).
  • Lifecycle (C) of the product itself.
  • Behavior (B) of the data includes retention, refresh frequency, availability time, and latency.
  • Time (T)-related indicators.

Why does it matter?

There are a lot of benefits to the classification and definition of the elements forming the Data QoS, as in the service-level indicators and the data quality dimensions.

Definitions we can agree on

The first step of the Information Technology Infrastructure Library (ITIL) is to set up a common vocabulary among the stakeholders of a project. Although ITIL might not be adequate for everything, this first step is crucial. Data QoS offers an evolutive framework with consistent terms and definitions.

Compatibility with data contracts

The data contract needs to be built on standardized expectations. It's obvious for the data retention period, as you would probably not see duration, safekeeping, or something else. However, latency and freshness are often interchanged; let's go for latency.

Setting the foundation

Data QoS is not carved in stone, even if it could be compared to the Rosetta stone. It supports evolution and innovation while delivering a solid base.

Takeaways

In this article, I share my strong feeling, developed over the years: data quality is insufficient. Although data quality is becoming increasingly normalized, it still lacks service levels. Service levels can have a profusion of indicators and are open by nature. Combining data quality and service levels can create a higher level of dimensions/indicators grouped in Data QoS.

The representation of Data QoS can be in a Mandeleev-like periodic table featuring each element in a neighboring context.

Do not hesitate to start a conversation and see how Data QoS can help your organization.

From the EDM Council, here are the definitions of the Data Quality dimensions: accuracy, completeness, conformity, consistency, coverage, timeliness, and uniqueness.

From my previous work on data quality and SLA: Meet Cactar, Data SLA.

--

--

Jean-Georges Perrin
ProfitOptics

#Knowledge = 𝑓 ( ∑(#SmallData, #BigData), #DataScience U #AI, #Software ). Lifetime #IBMChampion. #KeepLearning. @ http://jgp.ai