The Modern Data Products are Programmable

Published in

Exploring the Frontier of Data Products

6 min readMay 25, 2024

For some years now, the “everything as code” model has been establishing itself as the standard practice among software development companies. That’s because these businesses are now seeing that using these external assets and resources to automate tasks, streamline processes, and avoid maintenance activities is the best way to increase their flexibility and achieve resilience. My claim is that data economy including anything data product related can and will not be different. Everything as Code role in that will increase. I gave this topic a few hours of my time and described what this means for data products.

Open dialog between maintainers

As maintainer of Open Data Product Specification, I also participate in the work of sibling standards such as Data Contract Standard and Data Product Descriptor Standard (DPDS). I do that in order to learn, but also to seek opportunities to create interoperability that is needed at the tooling level. I wrote an article about that some time ago.

The DPDS can be seen as “a rival” to the Open Data Product Specification, but I do not see it like that. The mentioned data product specifications started from somewhat different foundations and needs. The future will tell us if the specifications will merge or stay separate.

I have had multiple discussions with the maintainer of DPDS and egoism on both sides is absent. We have agreed to cooperate since we share similar goals that go beyond the individual specifications. I love the open atmosphere of sharing and exchange of viewpoints we have managed to create.

What about Everything as Code and data products?

One of the topics in the discussions I have had with the other maintainers of the above specifications has been Everything as Code and automation. One practical approach to defining what EaC can be is to look at it from a programmable/programmability point of view.

Cambridge Dictionary defines (IT) programmable as: “used to describe a computer or machine that is able to accept instructions to do a range of tasks, rather than just one”

In this article I refuse to stay inside rigidly defined boundaries on what is programmable or programmability and stretch a little in order to explore the concept of programmable data products.

The term “programmable” might taste in your mouth a relic since it has been around for a while and pretty much all became familiar with it due to Programmable Web success. ProgrammableWeb was an information and news source about the Web as a programmable platform. It was the place to go to if you had API-related needs. It was shut down after 17 years in operation. Yet the term programmable is still valid and I chose that as an approach for this article to address the future of data products. Let’s begin with the APIs.

Programmability with the help of APIs

A few years ago, business networks were seen as “dumb pipes” for data transmission, managed manually via command line interfaces (CLI). Today, networks must be agile and scalable, able to deploy quickly and adapt to changing needs. Manual CLI efforts are outdated.

Automation and network programmability are key to modern enterprise networks. Automation manages infrastructure centrally, while programmability replaces manual tasks with software interfaces, enhancing efficiency, service speed, and adaptability.

Network programmability is crucial for IoT, cloud, 5G, and edge computing. The industry is moving towards controller-centric environments, with vendors offering programmable interfaces (APIs) for network management. As APIs evolve from cloud-based to integrated networking infrastructure, new applications can now be developed similarly to software code. This shift, known as “infrastructure as code,” enables more efficient and rapid automation of networking infrastructure, allowing software applications to swiftly program the network.

APs in data product specifications

This API-first approach is visible in the Data Product Descriptor Specification. In DPDS a data product exposes interfaces to external agents through entities called ports, which are grouped by functional roles. In the drawing taken from DPDS website, look at the greenish donut around the core.

There are five types of ports supported by the DPSD: Input ports collect source data for internal transformation, Output ports share the generated data reliably, Discovery ports provide information about the data product’s static role in the architecture, Observability ports offer insights into the data product’s dynamic behavior, and Control ports manage local policies or governance operations. Each data product can have one or more of each type of port.

Programmability as injections and modifications to match business needs

Commonly used platforms such as Zapier have adopted this a long time ago. Zapier provides the framework and you can adjust and “modify” it to your specific needs with small code snippets and other low-code/no-code features. In short, you can override and extend the default behavior of the platform-provided functionalities.

New Relic is a cloud-based software that allows websites and mobile apps to track user interactions and service operators’ software and hardware performance. New Relic is a programmable platform that lets you write custom code to add new features and create unique visualizations.

Data Product SLA monitoring as code

Speaking of monitoring and tracking, modern versions of data products have service level entity (SLA, SLO). Customers do want to have some indication of service level and thus Open Data Product Specification has long time had SLA object in which data product owner has been able to define given SLA. In the ODPS version 3.0, Everything as Code was expanded to include SLA as well and now monitoring can be defined as code inside the spec element. The Schema of the SLA object is the same as in Data Quality (discussed later in the article).

In the below example Prometheus is hypothetically used to monitor data product SLA and monitoring part is added as pure code that can be included in Promotheus platform for execution. As a conclusion data product metadata contains the business level commitment levels of SLA but also rules to monitor that commitment.

Programmability as instructions

Likewise in data economy data quality monitoring can now be defined as machine-readable injections. Those injections are not per se code, at least not always, but more like instructions that the actual data quality platform can automatically use in executing the data quality assessment tasks. I admit that this can also be seen as pure configuration rather than programmability if you take a rigid stand. Examples of such Data Quality “as code” supporting platforms are SodaCL and MonteCarlo.

The programmability here is that you do not just define Data Quality levels for the data product, but define functionality to measure, validate, and monitor it. Support for this was added to the Open Data Product Specification 3.0 release. In ODPS you can now define the target levels for DQ indications and optionally add “spec” part that contains the rules “as code”.

Discussion

Admitted that the “spec” part in Data Quality and SLA is in both cases still very vendor specific and we could use standard for this part supported by multiple vendors.

We should be able to define the Data Quality rules in machine-readable format “as code” at the data product level and then use the rules in various vendor solutions. The same of course for the SLA.

I would also expect “as code” to materialize also in the pricing plans. ODPS contains 12 standardized data product pricing plan models to use. The Everything as code part could be for example dynamic pricing and the actual values in pricing plans would be code driven. In the below example you can see how the static pricing plans can be defined in ODPS now, but injecting “as code” part needs to be done.

Data Product Descriptor has done marvelous job on API standardization on data product level while Open Data Product Specification has focused on applying EaC to Data Quality and SLA.

As I said in the beginning, maintainers of both specifications have lively and open dialog, perhaps we will see innovations slipping from one side to the other in the future.