“Data QoS as Code” fitted into the Open Data Product Specification trial

--

Abstract

In exploring emerging data contract and data product standards, I found innovations like the “as code” approach for data quality in the Data Contract Specification. However, this lacks a detailed model. Data Quality of Service (QoS) offers a holistic framework for data quality and service-level indicators. To enhance this, I integrated the Open Data Product Specification (ODPS), a metadata model containing data quality and SLA objects. Since Data Contract Specification lacks SLA, I opted for ODPS as a broader context.

Combining Data QoS, “as code”, and ODPS, I created a draft documentation for the Open Data Product Specification, injecting Data QoS as code and providing a sample “hello world” example. While a logical model exists, a list of available tools and services for implementation is pending. This integration aims to leverage the strengths of each standard for comprehensive data management and quality assurance.

Context

In my exploration with emerging data contract and data product standards I have come across some innovations, which have intrigued me and inspired my to take the idea a bit further. One those innovations is that Data Contract Specification selected “as code” approach for data quality. “As code” is an approach worthy to explore as “everything is becoming code”. Yet the data quality object it self in the data contract specification lacks details, a clear concise model, a foundation to build upon.

This is where another concept labelled as Data QoS might become handy. It offers a model to present data quality and service-level indicators as one holistic framework. But describing just small fraction of a data product as model is hardly useful. Thus I decided to add 3rd ingredient to the soup. Open Data Product Specification is a holistic metadata model for data products which also contains objects for data quality and SLA as well. Data Contract Specification does not include SLA in latest version and thus I chose ODPS as larger context to use.

With these ingredients I started to model what might be the combination of all three so that the result takes the best of each ingredient. In short, I injected the comabination of Data QoS and “as code” parcel into the Open Data Product Specifcation.

Ingredients of the soup

In the result I have combined elements and concepts taken from

  1. Data QoS model explained by Jean-Georges Perrin,
  2. Data Contract Specification, and
  3. Open Data Product Specification (author is maintainer).

The proposed approach which combines the three is what I have labelled as Data QoS as Code. It is a construct which has not been implemented anywhere yet, but offers an approach to combine Data QoS with “Everything as Code” philosophy and one of the emerging data product metadata model specification.

Data QoS

Data Quality of Service (Data QoS) is a concept merging Data Quality (DQ) with Service-Level Agreements (SLA). It draws parallels from Quality of Service (QoS) in network engineering, which measures service performance. QoS criteria include packet loss, throughput, and availability. Data QoS addresses the complexity of measuring data attributes as businesses evolve. Inspired by Mandeleev’s periodic table, the author proposes combining DQ and SLA elements into a unified framework for better data observation. This approach aims to simplify data management amid growing business needs.

https://medium.com/profitoptics/what-is-data-qos-and-why-is-it-critical-c524b81e3cc1
source: https://medium.com/profitoptics/what-is-data-qos-and-why-is-it-critical-c524b81e3cc1

In the above illustration darker boxes are data quality related and Airbaltic green colored boxes are service-level indicators. Service levels provide vital insights into data availability and condition, complementing data quality assessments. Employing service-level indicators (SLIs) helps gauge performance expectations for data delivery. To ensure efficient production systems, establish service-level objectives (SLOs) aligned with user expectations, outlined in service-level agreements (SLAs).

Data Contract Specification

Data contracts serve as bridges between data providers and consumers, defining data exchange parameters. They specify data structure, format, semantics, quality, and usage terms. Implementable through output ports or other technologies, data contracts ensure data consistency across various platforms like AWS S3, Google BigQuery, and Databricks. They facilitate communication between teams, clarifying data interpretation and expectations.

Created collaboratively, data contracts guide development processes, including code generation, testing, and access control. The Data Contract Specification, using YAML format, aims for platform-neutral compatibility. The Data Contract CLI aids in creating, validating, and enforcing data contracts. While termed a “contract,” it’s more of a specification owned by providers, aiding in tracking data usage through data usage agreements.

There are at least two emerging data contract standards: Open Data Contract Standard (under Linux Foundation) and Data Contract Specification. The latter has applied “Everything as Code” in the Data Quality object and that is why I have picked that as one of the ingredients as source for inspiration in order to model “Data QoS as Code”.

Open Data Product Specification

The Open Data Product Specification is a vendor-neutral, open-source machine-readable data product metadata model. It defines the objects and attributes as well as the structure of digital data products. The work is based on existing standards (schema.org), best practices and emerging concepts like Data Mesh. The specification is built on experiences gained from over 300 data product cases. Current latest production version is 2.1.

Current production and development version of ODPS contain Data quality and SLA Objects. In this exercise I will take examples from development version which is in YAML, as the ODPS is making transition from JSON as default to YAML to be aligned with Data Contract practices. Another reason for YAML as default is that it is the common markdown format used in configurations and alike — and that is after all what data product description is, a blueprint of a data product.

Below is imaginary and somewhat simplified example of the SLA and Data Quality objects defined according to ODPS.

Latest version of ODPS has separate objects for SLA and Data Quality. The approach is for the most parts traditional and defined the indicators and objective values. A slight “as code” direction can be found from it still. For the SLA, the model contains observability object which contains attributes to define services to see for example uptime statistics. That is still intended for humans, not machines. Similarly the Data Quality object contains monitoringScriptURL attribute, which points to data quality monitoring service or code. If the given URL leads to actual code and not running it as service, then this can be labelled as simple “as code” implementation. Yet, the ODPS model is still lacking clear “as code” elements and the SLA and Data Quality are separate objects.

What if we combine SLA and data quality into one

What if we do not separate SLA and Data Quality into separate Objects per se but put those indicators order one object namely DataQoS and add “as code” philosophy in parts it is feasable? To explore the answer to the question I started to model the result logically before jumping into YAML experiments.

Modelling the Data QoS as code Object

Initially I took the 19 indicators from Data QoS model and started to model how to apply it. Each of them is an object in the model and each of them have two parts (objects): objective and monitoring. Objective is the target state of the indicator. For example availability from SLA for which business sets the minimum level to maintain. The other part, monitoring, is the “as code” part and defines the rules for verifying the objective (and failure as well).

Simplified logic can be visualized as below illustration, which describes the approach I started to apply for all 19 indicators. This approach was selected after some experimentation. The aim was to find single replicable model (Schema) fitting in all suitable cases and enable business elements to coexist near the “as code” parts.

Selecting suitable indicators (16) for “as code” approach

I ended up not applying the above model to all suggested 19 indicators since not all of them necessary require “as code” monitoring. Such indicators like General availability, end of support, and end of life are information which in Open Data Product Specification belong to document level defining general features of the data product. Eventually I ended up having 16 of the suggested indicators in the trial version of DataQos object.

Injecting Data QoS as code object to Open Data Product Standard

Final step in this trial is to inject it into existing standard. The above object could be injected into any standard. In order verify how the above described model would fit into the Open Data Product Specification, I forked the development version of the ODPS and created stand-alone example since the change is so dramatic and it is not sure if this will be applied to standard. Providing as complete example of the result enables more detailed discussion.

The ODPS documentation has structure which allows to show in parallel the descriptions of the objects and attributes with the example implementation as YAML. The sample documentation most likely contains inconsistencies and even plain errors. It is after all a draft trial to see can it be done and would it even make sense.

You can look at the specification draft and example YAML markdown from here https://open-data-product-initiative.github.io/DataQoS-as-code/#data-qos

Discussion

Scattered “as code” content

In the trial “as code” parts are scattered inside each indicator, which might make it a bit cumbersome in practice. As comparing for example Data Contract Standard structure follow the pattern that all Data Quality rules are inside one attribute. Benefit of the scattered is that each indicator can be monitored with different services. In reality data product provider most likely does not have separate monitoring services for each indicator but uses one for many.

The trial model also does not enable easy separation of data quality and service quality indicators. That could be achieved by adding additional string attribute indicating which it is.

Serve both business and monitoring

By keeping the indicator threshold level definition near the monitoring parts would seem logical. Without having threshold levels set by the business, verification of desired quality level is impossible. Those threshold values must be defined somewhere. In addition, in some cases actual legal agreement is generated between provider and consumer. In the agreement promised quality levels are discussed in detail. The proposed model enables pulling the business set threshold values from the data product blueprint.

Should we use triplets or fixed object names

In the trial Data Qos indicator objects are fixed. The model contains 19 objects named as those are defined in the Data QoS model discussed in the beginning. This approach fixes the schema of the Data QoS object and assures interoperability but at the same time makes it somewhat rigid. Fixing the objects also makes it more difficult to misuse the standard.

The alternative approach is similar to what Open Data Contract Standard has. In the alternative approach Data Qos objects would be triplet of property (value is the indicator name), unit (measure unit), and value (objective value of the indicator). If that approach would be applied then at least the property string values should be fixed into an enum. If that is not done, then we can say goodby to interoperability as everyone can define the properties freely.

Lack of as code systems for all indicators

Due to lack of time, constructing a thorough list of possible systems in the markets offering “as code” solution for each indicator was not done. It remains open in practical level, how we could implement right now the proposed model with off the shelve products. As the Everything as Code philosophy becomes more popular it could result to expanding solutions that fit in here as well. Until that, only approach available is to do all as custom software. Eventually it should be possible to build a table listing all indicators in one column and on the next colum names of the suitable services. Then we could take this into use in scale.

--

--

Jarkko Moilanen (PhD)
Exploring the Frontier of Data Products

API, Data and Platform Economy professional. Author of "Deliver Value in the Data Economy" and "API Economy 101" books.