Data Mesh— A Technical Implementation — Data Product

Paul Cavacas
11 min readApr 18, 2023

--

This series is going to follow my trials and tribulations of creating a vendor independent Data Mesh. The series will follow along as I build out the underlying system and will focus primarily on the technical aspects of the implementation. There are numerous other articles that go deep into the social side of Data Mesh, which is extremely important. This series is going to focus on the technical side. We will be using a number of technologies and vendors, but the way the underlying Mesh is being created these vendor specific parts should be able to be swapped out at will.

It assumes that you are familiar with the general pillars and principals around Data Mesh. There are many other resources that can provide this general overview of what Data Mesh is. Additionally, a number of articles go into deep details about the social aspects from company and team layout to communication and roll-out plans.

Design

List of high-level features being implemented:

  • Standard definition for Data Products
  • Standard inputs for Interoperability
  • Standard outputs for Consumer Access
  • Track and score different areas of the Data Products, such as SLAs, Usage, Cost, etc.
  • Testability of Data Products
  • Lifecycle of Data Products
  • APIs to control entire Data Mesh

The first 2 parts of the series will (this one and the next one) focus on the standard definition of a Data Product and the general layout of the API layer.

Technologies Used

  • Python for APIs
  • Azure for Storage
  • Snowflake for standard Database
  • Power BI for Visualization

As mentioned, these are the technologies that are being used, but the design is being done in such a way that the actual technologies being used is irrelevant.

Data Product Definition

From a technical perspective the Data Product is the central component, so the first order of business is to define what a Data Product will look like and what features it should contain. This needs to be done in such a way to allow for customization and adaptation as we progress. Therefore, we will be storing the Data Products as JSON. Instead of inventing a standard from scratch let’s see if we can use something that exists already. To that end, I’m going to use the Open Data Mesh standard defined at Interface Components — Open Data Mesh.

The interface defined here is broken into 4 main sections, with one of the sections further broken down into 5 sections. The 4 main sections are info, interfaceComponenets, applicationComponents, and infrastructuralComponents.

As we progress the definition may adapt and change to meet various needs and what is presented here is the starting point to provide a general focus.

info section

The first section defines the general information about the Data Product. It contains things like the name, description and version, along with the domain and owner of the Data Product.

To start all of the Data Product definitions (JSON files) are going to be created by hand, but I’m envisioning and application that will allow easy creation of these definitions and possible automatic creation of at least part of them Data Products from existing artifacts like databases or files.

One reason for choosing the Open Data Mesh standard is that it allows for easy extension of the defined schema and I can already envision a few potential additions to the info section, such as a score or rating system for the data products, as well as a few additional general classification types.

Below is an example of the info section:

"info": {
"name": "orders",
"displayName": "Sales Orders",
"description": "Contains all of the orders for the company.",
"version": "1.0.0",
"domain": "Sales",
"owner": {
"name": "John Doe",
"id": "jdoe@company.com",
"x-team": "Operations Team",
"x-team-id": "operations@company.com"
},
"x-securityClassification": "Internal",
"x-type": "Producer Aligned Data",
"x-score": {
"x-userScore": 4,
"x-temporalScore": 3,
"x-qualityScore": 2,
"x-usageScore": 3,
"x-stabilityScore": 5
}
}

You can see in the above example that there is the basic information; name, displayName, description, domain, and version for the data product, which should all be pretty self-explanatory. Additionally, the owner of the Data Product is defined in this section. You can see that the first extension to the Data Product has been added with the x-team attributes. I believe that it is important that we have a true single owner, as well as the team, in case the original owner leaves or changes responbilities.

There are 3 additional extensions in the x-securityClassification, x-type and x-score attributes. I’m not sure at this point if those will be truly useful or not, as they are really just additional ways to classify and categorize the Data Products.

Security Classification — is the general security that should be applied to the Data Product, so that consumers will know how this data should be treated.

Type — There are 2 main types, Producer Aligned Data and Consumer Aligned Data. Producer Aligned Data is typically more technical and used as building blocks for other Data Products. A lot of times these Data Products will be single tables or views. Consumer Aligned Data is products that are targetted more to the end user consumption. They are typically used to answer questions and drive business results.

Score — is a section that will used to in the eventual end user displays of the Data Mesh. This is a compound scoring system that can be used by the consumers to understand how “good” this Data Product is. You can see that it contains several subcomponent score to rate different aspects of the Data Product.

interfaceComponents

This section contains the bulk of the Data Product as it contains all of the detailed definitions about the data the Data Product interacts with. This section is further broken down into 5 types of ports; inputPorts, outputPorts, discoveryPorts, observabilityPorts, and controlPorts.

Each of the port definitions define a way to look at an inspect the Data Product and will be covered below.

inputPorts

The Open Data Mesh standard defines several alternatives to how data is defined in this section, the primary use for it is to describe the data that feeds into this Data Product, whether it is an API, a Database, another Data Product, or even just a general written description.

This section is important to help establish the lineage and interoperability of the Data Product within the Mesh.

Below is 1 example of what this section might look like:

    "interfaceComponents": {
"inputPorts": [
{
"entityType": "inputPort",
"name": "invoices",
"version": "1.0.0",
"promises": [
{
"platform": "SnowflakeAccountIdentifier",
"servicesType": "mesh",
"contract": {
"schema": [
{
"table": "INVOICES",
"displayName": "Invoices",
"columns": [
{
"name": "ORDER_NUMBER",
"dataType": "NUMBER"
},
{
"name": "PRODUCT_ID",
"dataType": "NUMBER"
}, ]
{
"name": "QUANTITY",
"dataType": "NUMBER"
},
{
"name": "AMOUNT",
"dataType": "NUMBER"
},
}
]
}
}
]
}
]
}

There is a lot to unpack in that small sample and as I mentioned different schemas that can be used in this section depending on where the data is coming from (API vs Database vs file, etc.). You can see that it is defining a name, version and the fact that this is an input port. Additionally multiple input ports may exist, for example if Data Product A & B both feed in to create Data Product C, then you would have 2 input ports defined, one for A and one for B.

Beneath that in the promises section you can see that this is defining a service type of mesh, meaning that this Data Product is being sourced from another Data Product and as I mentioned Snowflake happens to be the underlying database that is being used here, so it has the account name for that Snowflake environment that contains this data.

Beneath this connection information you can see that it has a table and column definition explaining where this information comes from. Minimal information is captured about the source information, but we want to capture enough information, so that we can perform contract checks against the incoming data to ensure that upstream providers haven’t changed on us unexpectantly.

outputPorts

The outputPorts defines how consumers will be able to interact with our Data Product. Here you may have multiple output ports for each type of output that you support for example, Snowflake, CSV, JSON, API, Power BI, etc.

The information that is in this section contains all of the tables, fields and other attributes about the data being exposed.

        "outputPorts": [
{
"entityType": "outputPort",
"name": "snowflake",
"description": "",
"version": "1.0.0",
"promises": [
{
"platform": "SnowflakeAccountIdentifier",
"servicesType": "sqlTable",
"api": {
"services": [],
"schema": {
"databaseName": "SALES_DATABASE",
"databaseSchemaName": "PUBLIC",
"tables": [
{
"name": "SALESORDERS",
"displayName": "Sales Orders",
"tableType": "VIEW",
"x-tests": [
{
"test-category": "ML",
"test-type": "RowCountByDay",
"test-severity": "Warning",
"date-field": "Create_Date",
"date-offset": -1
}
]
}
]
}
},
"deprecationPolicy": {
"description": "Currently no deprecation policy exists changes happen automatically and replace the existing table"
},
"slo": [
{
"type": "loadDate",
"max": 47,
"unit": "hours"
},
{
"type": "factDate",
"max": 47,
"unit": "hours"
},
{
"type": "completenessPercent",
"max": 100,
"unit": "percent"
},
{
"type": "uptimePercent",
"max": 100,
"unit": "percent"
}
]
}
]
}

At the top of this section, you can see the type of port being defined, in this example a Snowflake output port. Beneath that you see connection information about where this data resides, you can the account connection information for Snowflake as well as the name of the database and table. Originally in this section I had the columns and their definitions as well, but that would have meant repeating the column information for each of the defined output ports, so I removed the columns from this section, and we will find them further on in the Discovery section. One enhancement that needs to be made here is the ability to have the output port override the default columns. This way if the Data Product defines 10 columns, but when used in the CSV output I only want 8 columns returned that should be able to be defined.

Below the table you can see another planned extension in the x-tests section. This section will be used to define various types of tests either business rules (using Great Expectations) or in this case ML test. These tests will be run against the Data Product’s data to ensure that the data is as expected. You can see that the type of test being defined here is a RowCountByDay, meaning that we expect a similar distribution of the number of rows that are loaded each day, but that count may vary a lot by day of the week or season of the year, which is why we are using an ML test, so that these trends are picked up.

Below that you can see various SLOs being defined, which the Mesh will also automatically check to ensure that the Data Product is matching the expected SLOs.

discoveryPorts

Discovery Ports are used to provide information about the Data Product that should be presented to the consumers around the details of the Data Product tables, fields, and relationships. To prevent duplication of work the column information from the dicovery port will be concatenated with the output ports to provide a complete picture of the Data Product.

The structure of this section is similar to the other ports that we have looked at so far and an example can be seen below:

        "discoveryPorts": [
{
"entityType": "discoveryPort",
"name": "catalog",
"description": "",
"version": "1.0.0",
"promises": [
{
"platform": "catalog",
"servicesType": "catalog",
"api": {
"services": [],
"schema": {
"tables": [
{
"name": "SALESORDERS",
"displayName": "Sales Orders",
"x-lineage": ["inputport:invoices"],
"columns": [
{
"name": "ORDER_NUMBER",
"displayName": "Order #",
"description": "The unique internal order number for this order",
"dataType": "DECIMAL",
"precision": 15,
"scale": 4
},
{
"name": "PRODUCT",
"displayName": "Product",
"description": "Which product was ordered",
"dataType": "string",
"data-length": 150
},
{
"name": "QUANTITY",
"displayName": "Quantity",
"description": "How many of the product was ordered",
"dataType": "integer"
},
{
"name": "AMOUNT",
"displayName": "Per Unit Amount",
"description": "The dollar amount for each unit ordered",
"dataType": "decimal",
"precision": 15,
"scale": 2
},
{
"name": "LINE_TOTAL",
"displayName": "Total $",
"description": "The total for this line of the order (Quantity * Per Unit Amount)",
"dataType": "decimal",
"precision": 15,
"scale": 2 },
]
}
]
}
}
}
]
}
],

In this section you can see that the platform and name and type are all defined as catalog, because this port is serving up the data that is needed for the Data Catalog that will be provided by the Mesh.

Each of the tables and columns are defined in this section along with details about the type and description for each.

observabilityPorts & controlPorts

We will cover these ports in more detail in a later section. The main purpose of the observability port is to provide a view into the Data Product to ensure that it is being used and tracked. It will provide things like usage details along with trace, log and cost information.

The control port is used to provide various admin functions for the data product.

Both of these a pivotal to the Federated Computational Governance aspect of Data Mesh.

applicationComponents

Application components are used to describe various aspects about the running of this data product. The first example that we will use this for is to be able to trigger the process of loading a batch data product.

The Mesh that is being built will know about Airflow and will be able to trigger DAGs written in Airflow to cause the Data Products data to be updated and refreshed.

Here is the general structure of this section:

    "applicationComponents": [
{
"entityType": "application",
"applicationType": "dag",
"platform": "Airflow",
"x-command": "DBT_Sales_Invoices",
"buildInfo": {},
"deployinfo": {}
}
]

You can see that it is defining pieces of information about the platform that it is being run against, in this case an Airflow DAG as well as an extension x-command, for the actual DAG name that is to be run.

infrastructuralComponents

Infrastructural components are used to define different components that make up this Data Product. This section can be used in combination with CI/CD pipeline to setup and deploy the pieces that are needed for the Data Product.

Below is an example section:

    "infrastructuralComponents": [
{
"entityType": "infrastructure",
"platform": "snowflake",
"infrastructureType": "dbt",
"repo": "Snowflake-Sales_DB",
"file": "dbt/models/publicviews/salesorder.sql",
"dependsOn": "",
"provisionInfo": {},
"description": "Combines invoices information"
},
{
"entityType": "infrastructure",
"platform": "airflow",
"infrastructureType": "dag",
"repo": "Airflow",
"file": "templates/dags/engineering_salesorders/salesorders.py",
"provisionInfo": {},
"description": "Runs DBT command to update the sales order table from the definition"
}
]

Here you can see that 2 components are needed for the Data Product, one is a dbt module and the second is the Airflow DAG that we mentioned above. You can see that it defines things like what type of component is being created or configured along with the code repo and file that is used for each of these.

At this point these last 2 sections applicationComponents and infrastructuralComponents are more speculative in nature about thier actual usage then true final implementations and as we build out the features on the Mesh we will adapt these, and other sections as needed.

The main difference between the application components section and the infrastructural components section is that the application components is more externally facing and day-to-day operations of the Data Product, while the infrastructural components section contains more around the internals of the Data Product needed for setup.

Summary

I hope that the above definition of what I see as the core features of a Data Product has piqued your interest and that you continue along in the series. If you agree or disagree on any of these points, please let me know, as I’m just starting to explore the creation of the Data Mesh and I’d be happy to pivot or adjust based on others feedback.

Part 2 — Data Mesh — Technical Implementation — API | by Paul Cavacas | Apr, 2023 | Medium

Part 3 — Mesh Implementation — API — Data Product Services | by Paul Cavacas | Apr, 2023 | Medium

--

--

Paul Cavacas

Technologist who has been involved in many aspects of development, from Full Stack Developer to Data & Analytics