Data Quality of Service (DataQoS) as Code — Reusable Component

Published in

Exploring the Frontier of Data Products

9 min readFeb 17, 2024

Abstract

This article introduces the concept of Data Quality of Service (Data QoS) as Code, a novel approach that merges data quality and service quality metrics into a unified model. Data QoS developed by Jean-Georges Perrin serves as the foundation. Data QoS as Code integrates the principles of Everything as Code (EaC) into the original concept and data quality management, allowing for both business-defined threshold values and monitoring capabilities to be encoded directly within the specification. The experimental specification demonstrates the application of this model through examples and definitions for four out of nineteen indicators, highlighting its potential as a reusable component. Furthermore, the integration of Data QoS as Code into data contract and product specifications is proposed to enhance interoperability between standards. The significance of Data QoS as Code is underscored by its submission for presentation at the AI on Pi Day 2024 event, showcasing its relevance and potential impact on the future of data management practices.

DataQoS concept conversion to specification

Experimentations around the concept DataQoS as Code was started as part of the Open Data Product Standard as described in the previous article. After giving it a little more thought it was decided that the next steps in defining it more accurately, it would make sense to continue as an independent “standard”. DataQoS as Code might not be very useful independently, but it could be injected into other standards such as Data Contract and Data Product standards as a reusable component.

The original DataQoS concept defined by Jean-Georges Perrin combines data quality and service quality indicators into one holistic model. Data QoS as Code represents a groundbreaking shift in data management, merging the principles of Data Quality and Service-Level Agreements into an integrated framework. This approach leverages the concepts of network Quality of Service (QoS) for monitoring data service performance metrics such as packet loss, throughput, and availability.

https://medium.com/profitoptics/what-is-data-qos-and-why-is-it-critical-c524b81e3cc1

By adopting the Everything as Code philosophy, Data QoS as Code introduces a method for automating, scaling, and securing data monitoring and management. It utilizes a vendor-neutral, YAML-based specification to facilitate this, offering a streamlined and efficient solution for handling complex data quality and service-level requirements.

Data Quality indicators

As it was discussed in the previous article the data quality is already possible to cover with existing tools like Montecarlo and SodaCL among other options. For indicators defining data quality parts of the DataQoS as Code, practices defined in Data Product Specification are followed and applied, but slightly modified. This DQ part of the concept did not seem to be too difficult to progress with.

But then the other side of the DataQoS as Code, service quality, required more attention and discovery of possible options to consider.

OpenSLO for service quality indicators

While developing the concept I stumbled upon OpenSLO while looking for “SLA as code” examples.

OpenSLO is a service level objective (SLO) language that declaratively defines reliability and performance targets using a simple YAML specification.

The approach is to define SLOs as Code and enable decoupling from vendor-specific practices.

This standard facilitates the integration of reliability and performance targets directly into development and operations workflows. By adopting a simple and accessible format, OpenSLO aims to make SLOs more manageable and understandable, allowing teams to define, track, and meet their service reliability goals effectively.

The OpenSLO initiative is designed to support modern development practices, including integration into Git workflows, making it easier for developers to incorporate SLOs into their continuous integration and deployment pipelines. As an open specification, it is developed and maintained through a collaborative and transparent process, ensuring that it remains relevant and accessible to a wide range of users and applications.

This discovery led me to explore how service level indicators could be standardized in the DataQoS as Code with OpenSLO.

Refreshed Open DataQoS as Code

At this point, the previous less refined example of Open DataQoS as Code specification as part of Open Data Product Specification (ODPS) was refreshed to keep focus only on this specific possibly reusable component. All the surrounding specs of ODPS were removed and the result is just DataQoS as Code specification. The next version of open DataQoS as Code is still available via the same Github-rendered page utilizing Slate:

https://open-data-product-initiative.github.io/DataQoS-as-code

Previously while the DataQoS as Code was defined as part of ODPS, 3 indicators of the original DataQoS concept were not included since those had a place already in the ODPS. In this refreshed approach all the 19 indicators in DataQoS concept are to be included in the specification. The indicators contain both data quality and service quality aspects.

Open DataQoS as Code Specification aims

‘Open’ refers to the openness of the standard. Any kind of connotations to open data (a different thing) are not intentional, intended, or desirable. At this point, it became obvious that specification aims have to be defined more clearly:

Define DataQoS with YAML as a machine-readable vendor-neutral open specification

YAML was selected as the markup language to be used for practical reasons and it also seems to be the industry’s de facto standard in this type of application. The aim is to decouple DataQoS from vendor-specific solutions and offer an open standardized model to apply DataQoS. This practice is taking more and more foothold in the data economy industry. The emerging standards like Data Contract Specification and OpenSLO are supported by companies, that have their services and products based on the standards. Instead of keeping the standard as vendor secret, the companies have decided to offer the specifications openly on Github or websites with dedicated domains. Also tooling around the emerging standards tend to be open and offered as open source. Examples are CLI tools for OpenSLO and Data Contract Specification. Following this trend, Open DataQoS as code is published openly under open source license. No CLI tool has been developed yet.

Define data quality and service quality with 19 indicators as a holistic yet flexible reusable component

The Open DataQoS as Code component (will) include the 19 indicators of DataQoS concept which define both Data Quality and Service Quality aspects. It is not expected that all use cases require 19 indicators or even that all use cases require both service and data quality indicators to be used. In the application of the Open DataQoS as Code standard, users select which indicators to use. Some might use it only to define threshold values and monitoring rules for Data Quality, while others use just the Service Quality aspect of it. In some use cases, some of both aspects are applied by including data quality and service quality indicators in the data offering specifications.

Extend the DataQoS concept with Everything as Code to enable monitoring and define the business-driven threshold requirements.

The original DataQoS concept does not take a stand on the emerging Everything as Code philosophy. It is focused on defining the concept that others can apply. The Open DataQoS as Code component offers a standardized format to describe both business-defined threshold values and monitoring rules for the indicators. The latter, monitoring rules, is standardized with Everything as Code philosophy approach.

The first indicators defined as DataQoS as Code

In the initial experiment draft of the specification, some of the 19 indicators were already standardized and the aim is to do the same for the rest. It is yet unknown if all 19 indicators can be defined as standard with available solutions and practices (emerging standards and other described methods). Perhaps we need to make some more profound innovations for some indicators.

The first 4 indicators defined in the Open DataQoS as Code specification include 2 Data Quality indicators and 2 Service quality indicators:

Availability (Service Quality)
Completeness (Data Quality)
Conformity (Data Quality)
Error rate (Service Quality)

Currently, Data Quality and Service Quality indicators have slightly different Schema. Ideally, those should follow the same pattern, but for now, all DQ indicators follow the same Schema and likewise, Service Quality indicators have a bit different Schema.

Service Quality Indicator Schema — Availability as an example

The availability service quality indicator is now defined following the OpenSLO specification. Both Objectives and ratioMetric elements follow OpenSLO. This approach enables this part to be defined in external YAML file and be included if needed but also used as is directly in a OpenSLO compatible monitoring system.

The other element, objectives, defines the business-driven threshold metric values. By keeping aimed minimum service level values and monitoring rules close to each other enables easier validation of service level.

The other service level indicators now added to Open DataQoS as Code specification follow the same pattern. Notice that type attribute under monitoring object defines the standard to use under spec. At the moment only option defined is OpenSLO, but it could be extended to include other models as well. In the above example availability can be calculated from the two values retrieved with good and total. After that the result can be measured against the given objective to validate have we reached the required availability level.

Data Quality Indicator Schema — Conformity as an example

Data Quality indicators have a similar Schema with small adjustments for practical reasons.

This Schema spec contains still the “cut and paste” rules to be injected into the monitoring tool, but the objectives are not part of the spec element. To validate the conformity at the required level (defined in objectives) contents of the spec can be executed in Soda and the result compared to requirements.

Discussion

Towards unified Schema

As it was described above Data Quality and Service Quality indicators have slightly different Schemas at the moment. Ideally, all indicators should follow the same Schema. As a result, the specification would be more simple to adopt and follow. Different Schemas are now applied for simple reasons. Whatever is under the indicator’s spec element should be “cut and paste” compatible with existing tools or standards (such as OpenSLO). As an example discussed conformity data quality indicator. In that case anything under spec is a direct input to SodaCL system.

A bridge between different standards

Open DataQoS as Code specification as an independent standard might not have too much value. Developing it in “isolation” and with tight focus can make progress faster. It is also possible that components like Open DataQoS as Code could act as a bridge between specifications such as Open Data Product Specification and Open Data Contract Standard. Both specifications could utilize the same reusable component and that would increase the interoperability between the standards.

In the above cases, the use of DataQoS as Code component does not require including both Data Quality and Service Quality indicators. Instead, for example just suitable Data Quality indicators are included for example in the Data Contract. But still as a result both Data Product Spec and Data Contract would be compatible for that part.

The above connects the DataQoS as Code concept to my larger PhD research aim which is to identify and define methods to increase interoperability within the emerging standards in the data economy.

Is there a demand for DataQoS as Code?

Further development to include all Data QoS indicators in the specification needs to be done. We have now only scratched the surface. For now, specification development is halted and feedback is collected. No reason to polish if it does not resonate with anyone and there is no demand for it. Given that the specification gains attraction, shows potential value for practitioners, and other interested developers emerge, the development continues.

Submitted a talk proposal about this to AI on Pi Day 2024 to get some discussions started. Let me know if you have an event (online) suitable for this kind of topic looking for speakers.

Feedback needed

Take a look at the experimental specification and leave a comment or contact me directly on LinkedIn.