Exploring the Frontier of Data Products: Seeking Insights on Emerging Standards

--

I’m on a fascinating journey, delving into the evolving world of data products and their emerging standards. My research so far has introduced me to a range of intriguing concepts like the Open Data Contract Standard, Data Contract Descriptor, Data Product Descriptor Specification, and the Open Data Product Specification. While familiar with DCAT, my focus is on the newer, cutting-edge standards that are shaping our industry. This blog post is just an initial ice-breaker into a subject that has become the core of my 2nd PhD oriented research.

If the situation and development go further as more or less separate tracks with the emerging standards, the more difficult interoperability will be. Of course, over time the standards become more stable and mature which enables someone to implement conversion software between the standards. Still, difficulties would be expected as the standards become more vendor-specific, possibly evolving further away from each other. The industry yearns for unified standards but lacks initiative and drivers without slowing business-related development. There is a need for unified data commodity metadata standard(s).

What could be done? Standards are not the sexiest topic for most, but for are certainly critical in scalable and managed data economy endeavors. At the end of the post, I will offer two initial possible result options to aim for. Before taking a stab at the future, let’s have a look at why we need standards.

Data governance requires standards

Data governance needs to be based on standards for several critical reasons. Standards ensure that data across an organization is consistent and trustworthy. This is essential for making informed decisions and maintaining the integrity of data processes. With established standards, data governance helps prevent the misuse of sensitive business information or customer data. This is particularly important in the digital age, where data security is paramount. Standards in data governance facilitate interoperability and seamless data sharing, especially in systems like electronic health records (EHR), where data movement is crucial. By having standardized governance, organizations can add meaningful context and understanding to their data. This helps in translating raw data into actionable insight​s. Standards provide a framework for regulating data usage within an organization, ensuring compliance with legal and ethical requirement​s. Standards in data governance play a pivotal role in enabling interoperability and fostering data reuse, automation, and discovery.

Data sharing history and change agents

The evolution of data sharing is a journey marked by significant milestones in both scientific and business realms. Initially, data sharing was fundamental to scientific advancement, involving simple information exchanges among researchers. The digital era ushered in business-to-business data sharing, a crucial step for developing machine learning models and advanced analytics.

Open data, rooted in mid-20th-century freedom of information legislation, has significantly influenced transparency in scientific research and government. The Data Catalog Vocabulary (DCAT) emerged as a key player in enhancing web-based data catalog interoperability.

The rise of data marketplaces, predominantly cloud-based, revolutionized the efficiency of data sharing by connecting data consumers and suppliers. By the 2020s, data sharing evolved from a technical activity to a strategic business element, crucial for innovation and competitive advantage. Recent years have seen the emergence of new data architectures like Data Mesh and Datafabric, further transforming the data economy and management strategies.

More recently the progress on generative AI has ignited yet another flame and is expected to have a significant impact also on Data Economy. The old standards do not seem to be able to provide needed support for the new practices. None of the standards are AI -ready and need updating or refactoring.

It is an unknown territory what the effects of AI entering all parts of the data economy and value creation, but since AI is destined to change whole societies, it would be a miracle if the data economy and related standards would not be affected.

Emerging new metadata models for describing data commodities

In a broader sense, a data contract can be defined as an agreement between a data producer and its consumer(s), detailing the structure, format, semantics, quality, and terms of use for exchanging data. This concept is critical for effective data collaboration, ensuring that both parties understand and agree on how the data is to be provided and used. The listed standards approach standardization from two angles: contract and data product. Although those might look separate at first look, they are quite closely connected.

I’m reaching out to this knowledgeable community for insights. Are there other emerging standards in data products that I should be exploring? Any new developments or initiatives that are gaining traction?

Now let’s have a closer look at the listed emerging standards that aim to change the way we understand data products and related contracts metadata.

Open Data Contract Standard

The roots of this emerging standard are in the practical implementation of Data Mesh. Formerly known as the data contract template, this standard is used to implement Data Mesh at PayPal.

The team working to built an implementation of Data Mesh, realized the need for a resource descriptor. The number of elements needed in this descriptor kept growing. That’s when they decided to restructure the format and adopt a data contract approach. Some months later they open-sourced a version of the template. After that template was taken to to a broader community, AIDA User Group, where it started its incubation process. Although AIDA User Group is a fantastic organization, it is not suited for developing open source & open standards. That’s where the Linux Foundation came into the dance. On November 30th, 2023, AIDA User Group and the Linux Foundation AI & Data joined forces to create Bitol.

The technical structure of ODCS is built around a versioned and evolving framework, encompassing a common data model, standard data practices, clear architecture for implementation, and a detailed sectional approach to data contracts. The Open Data Contract Standard (ODCS) technical structure is characterized by the following elements.

Firstly, while not explicitly described as part of ODCS, similar standards like the Open Contracting Data Standard (OCDS) suggest that ODCS might also define a common data model for the disclosure of data and documents throughout the contracting process. Secondly, ODCS aims to establish reliable communication between data producers and consumers by setting standard data practices. This likely involves the definition of data formats, protocols, and guidelines for data exchange. Thirdly, the technical architecture of ODCS includes the principles and guidelines for implementing data contracts, ensuring that they align with the standardized framework and facilitate effective data transactions. Finally, ODCS includes eight distinct sections in a data contract. These sections are designed to create a robust legal and technical foundation for data products, encompassing various aspects of data exchange and usage.

Data Contract Descriptor

The data contract descriptor, as found on DataContract.com, is a tool or framework designed to define the parameters for exchanging data between a data provider and their clients or users. The data contract descriptor specifies the structure and format of the data to be exchanged. This includes how the data is organized, its layout, and the formats acceptable for exchange It outlines the semantics, meaning the context and meaning of the data, and quality expectations. This ensures that the data is not only technically correct but also meaningful and usable for the end users.

The descriptor includes terms of use, which define the legal and operational conditions under which the data can be used. This might cover aspects like data privacy, usage rights, and limitations. By establishing standard data practices, the data contract descriptor aims to create reliable communication between data producers and consumers. This helps in setting a common ground for expectations and responsibilities in the data exchange process. The data contract descriptor can be seen as an element in formalizing and streamlining data exchanges, ensuring clarity, consistency, and compliance with agreed-upon standards and practices. The data contract descriptor, like many data serialization scenarios, can be described in either JSON or YAML, depending on the specific requirements and context.

While the two first emerging standards selected the contract as a starting point, the two other standards started from data product concept, which is one of the key pillars of the Data Mesh.

Data Product Descriptor Specification

The Data Product Descriptor Specification is a comprehensive framework that provides a detailed, structured, and standardized approach to describing data products, fostering better management, interoperability, and utilization in data-driven environments.

The Data Product Descriptor Specification plays a crucial role in the data mesh paradigm by detailing the attributes and usage of data products. This specification is designed to define the characteristics, behavior, requirements, and usage of data products in a data mesh environment. It aims to provide clear and structured information about data products, making them more manageable and understandable.

The Data Product Specification focuses on collating crucial pieces of information related to various aspects of data products, ensuring all relevant details are comprehensively covered under a unified specification.It is a vendor-neutral, open-source machine-readable data product metadata model. This approach ensures wider applicability and adaptability across different platforms and tools, avoiding vendor lock-in.

The specification often employs JSON or YAML to declaratively define a data product in all its components. This makes the specification adaptable and easy to integrate with modern data systems and processes. A data product, as defined in this specification, is a logical unit that includes all components necessary for processing and storing domain data for analytical or data-intensive use cases, enhancing the utility and accessibility of the data.

A crucial aspect of DPDS is its focus on automation within Data Mesh environments. By providing a structured set of metadata, it facilitates the automated handling and management (including deploying) of data products, enhancing efficiency and scalability. DPDS includes guiding values and an operating model to support federated governance structures. This helps in aligning data product management with broader organizational goals and governance frameworks

DPDS has been compared with other specifications like DCAT, revealing differences in structure and linkages. DPDS features a flat structure, which may differ from the linked structure seen in other specifications like DCAT discussed above.

It includes a detailed description of the functional area the data product represents, its purpose, and relevant business-related information. This helps in aligning the data product with specific business needs and goals.

Open Data Product Specification

Open Data Product Specification (ODPS) is a framework designed to standardize the creation, publication, and management of data commodities, especially in data marketplaces. ODPS originally was started to model in machine-readable format all the needed metadata to create optimized experience for the customers looking for suitable data for their business needs. In the later versions ODPS expanded to include DataOps and other data product backend related structures and metadata.

Its uniqueness lies in several key aspects. ODPS is based on metadata models that are not tied to specific providers. This allows for a more standardized and scalable approach to data product creation and management across various platform​s. The specification treats each data product as an independent unit. This principle helps in ensuring that each data product is self-contained and can be utilized independently, enhancing modularity and flexibilit​y. Initially starting as an idea, the Open Data Product Specification has matured into a governed, sustainable standard. This evolution signifies its growing acceptance and utility in the field of data managemen​t.

ODPS includes technical specifications for machine-readable data product metadata models. It defines objects, attributes, and the structure of digital data products, making them more accessible and usable by machines. ODPS unique feature is the pricing element, which contains 11 standardized pricing plan models from freemium to value-based pricing.

Features of new emerging standards

Here is a few observations based on surface level gut feeling analysis. I will do an analytical and precise analysis of the standards as part of my research. I will do analysis of the standards and also conduct interviews to get into the core of each standard and reasons those were created. But for now, here is some observations to consider.

Emerging standards overlap

While 2 of the standards have selected the point of view to be contract, the 2 others’ approaches of commodity (data product and services), overlap in some parts. For example, Data Contract Descriptor has pricing-related attributes, but no model for standardized pricing plans which constitute the core of Open Data Product Specification.

Emerging standards are using lightweight description languages

The emerging standards are using JSON and YAML. Both formats are widely used for data serialization and have their own advantages. YAML, being a superset of JSON, can include JSON objects. It is known for its human-readable format and is often preferred for configuration files due to its readability and user-friendliness. In contexts where clarity and ease of understanding are crucial, YAML is a common choice. JavaScript Object Notation (JSON) is an open standard format that derives from JavaScript data formats. It is widely used in web applications and scenarios where data needs to be exchanged between a server and a web application. JSON is often preferred for its compatibility and ease of use in programming environments. The choice between JSON and YAML depends on factors like ease of use, readability, compatibility with existing systems, and specific use cases. Both formats are capable of describing data contracts and data products effectively.

Emerging standards have an organizational approach rather than pure data

While the DCAT has evolved into a detailed and in-depth method to describe data and cataloging enabling data exchange, the emerging standards describe the data as well, but have a strong focus on defining the rules and conditions, legal framework as well as service level agreement attributes. This is the case clearly with Open Data Contract Standard and Data Contract Descriptor. These two emerging standards have a stronger organizational approach to data, which is aligned with the principles and foundations of Data Mesh.

Emerging standards differ in governance

Two of the standards have defined and documented governance model which defines the decision making, release process, and structure of the standardizing team or organization. Bitol has Linux Foundation defined governance model and ODPS has a governance model described publicly next to the standard. As is was mentioned Bitol works under the umbrella of Linux Foundation, while ODPS is organized under Open Collective.

In some cases having a defined governance model is considered a must-have to get it accepted by the application development funding organization. This is the case especially if the funding is coming from the public sector. In those cases governance model seems to increase the credibility of the standard. Any standard to become credible and widely adopted in the long run needs a good solid and trusted host. Both the host and governance are considered as signs of continuity and that builds trust towards the standard.

On the other hand having and applying the governance model often also means that development will be slower and might not be agile enough for the needs of the industry. This aspect of standardization seems to be crucially important for the people behind the emerging standards. Perhaps the current standardization processes and models do not fit into current needs (but that is another topic to do research on). What ever the future is for data products related standards, it has to enable agility and speed.

Path forward and goal?

As it was discussed above, the longer the emerging standards keep on developing separately, the more issues and lack of interoperability we will have. In my opinion we should be seeking ways to bring relevant parties together and possibly with compromises find a more commonly approved standard to apply. Based on initial discussions with several practitioners we could set a goal, a goal we would be ready to accept as the next step towards more unified standard or set of standards.

Option 1: Monolith single standard

Option 2: Family of Standards

Obviously a more in depth analysis of pros and cons of the goal has to be done and that is what I am starting now. Another question is also does the monolith even make sense at all? What is clear also by glancing at the mentioned emerging standards, is that they overlap already. Thus in the drawings dashed boxes around the approaches overlap as well.

It is also possible that the initial step would be Family of Standards and possibly in the future we could have one monolith standard. But this is the journey I am going to commit now. I will discuss my research phases in the following posts.

--

--

Jarkko Moilanen (PhD)
Exploring the Frontier of Data Products

API, Data and Platform Economy professional. Author of "Deliver Value in the Data Economy" and "API Economy 101" books.