Alchemesh console: The core concepts
The announcement of the launch of our framework to support the Data Mesh has been made, and we can now begin our new adventure together!
The idea is to share with you, as we develop, our reflections and the technical choices we make.
The goal is to share through these articles our interpretation of the Data Mesh, present our development approach, get feedback on our choices, and most importantly, try to think together about the challenges of implementing the Data Mesh.
Alchemesh console: Standardize interfaces to facilitate acculturation and understanding
As we have introduced, one of the objectives of the framework, and particularly the console, is to provide support and a framework to help different stakeholders understand, interact with, and adopt the Data Mesh.
Our solution should be a means to convey the concepts of the Data Mesh! This is a major challenge for us, especially with an approach as broad as that of the Data Mesh.
Many concepts come into play: data product, data domain, data contract, polysemy, addressability, truthfulness, ownership, autonomy, etc.
Many questions arise: What are the interactions between the different concepts? Which component should carry which information? And so on.
In such a context, it is difficult to ensure that everyone shares a common minimum understanding and to minimize the risk of over-interpretation or misalignment among stakeholders. Additionally, it is important to define clear and well-defined spaces to enable teams to grasp the concepts and make strong proposals through feature requests.
For us, it was a natural choice to address these questions through the console by standardizing the definition of core Data Mesh concepts and their interactions, all translated into the interface.
Alchemesh: Core concepts modeling
⚠️ The version we are presenting here corresponds to what we defined during the MVP design phase; it is necessarily subject to change as we develop and implement new features. ⚠️
Users
Users are at the heart of the players who will interact in the mesh. In our framework, we distinguish between several personas:
- Data product developer: Considering a wide spectrum of skill sets - from generalist developers with general programming skills to specialist data engineers
- Data product consumers: Covers multiple roles that have one thing in common, they need to access and use data to do their job (e.g. data scientists, data analysts, application developer)
- Data product owner: Responsible for delivering and evangilizing successful data products for their specific domains.
- Data platform developer: Responsible for delivering the platform services as a product, with the best user experience
- Data platform owner: Build an operate the data platform as well as use it. Data platform developers who work on the data product experience plane services.
Data domains
Domain data ownership is the foundation of the scale in a complex system like enterprises today. Domain driven design (DDD)’s strategic design embraces modeling based on mulitple models each contextualized to a particular domain, called a bounded context.
A bounded context is “the delimited applicability of a particular model [that] gives team members a clear and shared understanding of what has to be consistent and what can develop independently.”
We support 3 types of data domains:
- Source aligned domain: Analytical data reflecting the business facts generated by operational systems responsible for providing the truths of their business domains as source-aligned domain data.
- Aggragated domain: Analytical data that is an aggregate of multiple upstream domains.
- Consumer aligned domain: Analytical data transformed to fit the needs of one or multiple specific use cases. This is also called fit-for-purpose domain data.
In addition to clarifying the role of a domain in relation to the data products it produces, this will also enable the federated data governance to define computational policies to properly govern the mesh (e.g., establishing a rule that devalues data products from a source-aligned domain that do not rely on any source system) or assist in prioritizing the reorganization of domains.
Technical teams
Depending on the size of certain data domains, an organization may decide to define multiple cross-functional teams to manage sets of data products. To address this need, we decided to have a concept of a technical team, bringing together people contributing to the same scope within a domain.
we distinguish between several kind of teams:
- Data product team: Stream aligned team, it is responsible for the end-to-end delivery of services (ingestion, consumption, discovery, etc.) required by the data product.
- Platform team: Its purpose is to enable stream-aligned to deliver their work with substantial autonomy.
- Governance group: Enabling team, its key role is to facilitate decision making around global policies. These policies then get implemented computationally and adopted by data product teams.
Source system
In the case of source-aligned data domains, the operational and analytical worlds are unified within the same domain and this is reflected in the cross-functional teams. It is important that the console materializes this connection.
The intention is clearly not to manage operational tasks within the data mesh platform, but it is vital to materialize this connection to bridge the gap between the two worlds beyond just doing so organizationally.
Data product
With domain ownership (supported by a technical team), the domain-oriented data is shared as product directly with data users.
Data as a product introduces a new unit of logical architecture called data quantum, controlling and encapsulating all the structural components needed to share data as a product.
By adopting a product approach, we will communicate the state of our offering:
- Lifecycle state: Where the data product is in its lifecycle — whether it is in development, in discovery, stable, or being decommissioned.
- Maturity level: A product considered stable but with little historical usage does not have the same maturity as a stable data product that has been used by many consumers for several years.
Input ports
In the context of source-aligned data products, data will need to be consumed from an operational system to make it available as input for the internal processing pipeline of the data product. This integration will be done via an input port (a platform component dedicated to this integration provided by the platform or implemented by the domain teams).
To give a concrete example, suppose that operational data is available in a Kafka topic and needs to be made available on a GCP project. The input port could involve provisioning a GCS bucket and a NiFi dataflow that ingests data from the Kafka topic.
Semantic model
On décrit les semantic models que le data product va offrir.
Machine and human-readable model definition that captures the domain model of the data: How the data product models the domain, what types of entities the data includes, the properties of the entities, etc.
Output ports
These models will be exposed as assets through an output port. Simply put, an output port is a pair consisting of a storage system (object storage, columnar table, streaming topic, etc.) associated with a proxy that allows access via different protocols and languages (SQL, REST API, GraphQL, etc.).
One of our stances on this matter is that an output port will not necessarily expose all the models managed by the data product.
Code
This is the core work of a data product developer, who is often too detached from the data they produce in legacy data tools and architectures. The data mesh places the code that creates the value of a data product at its center, and that is naturally what we do. This logic allows starting from the inputs to generate the output assets.
In the data product, it is the responsibility of data product developers to properly define their data product, consume and expose data via the standard ports, and maintain the related metadata.
In return, everything that happens inside (the code) is completely left to the discretion of the team: a Dagster Blog job, an Airflow DAG, a Kestra DAG, a simple Python job in a Lambda… The choice and responsibility lie with the owner (this is what we call autonomy).
Infrastructure
A data product may depend on infrastructure to be provisioned to carry out its processing, such as object storage, an intermediate dataset, etc., which are not related to how the code is executed, data is ingested, or data is exposed. This interface allows specifying to the platform what the data product needs for this.
Metadata
Asset
We consider an asset to be the instantiation of a data product model via an output port.
Once the data product is deployed and functional, the code must maintain certain status information to inform its consumers of its state:
- Overall state: operational, in incident, down
- State of the assets: their technical data quality (accuracy, completeness, timeliness, validity) and their freshness.
Data contract
We have our data product in our data domain, owned by a technical team, with data consumed from an operational system via an input port and exposing the value of the data product generated by the code via output ports. Great!
But before consuming this data product, I, as a consumer, would like to know what I am committing to, and as a producer, who commits to consuming from me! This is where data contracts come into play.
Output Port
A data contract applies to an output port of a data product, not to the entire data product. There are several reasons for this:
- Expectations differ between a streaming flow and an object stored in a data lake (in terms of response time, update frequency, accuracy, etc.).
- Not all output ports carry the same models, so the commitment to consumption is not the same.
Access type
Depending on the nature of the data product, access to it will not be authorized in the same way. We support three types:
- Restricted access: This means that the owner of the data product must review and validate any access request.
- Internal access: This means that all requests from within the same domain are auto-approved; otherwise, they require the owner’s validation.
- Public access: This means that all requests are automatically approved without review or validation from the owner.
Versioning and Lifecycle State
Data contracts are versioned and have a lifecycle state to inform about their status and to provide warnings in case of deprecation or changes.
Service Level Agreements
A data contract is a commitment to a service that we will provide and, more specifically, how we will provide it. Currently, we define the following commitments:
- Uptime
- Update frequency
- Response time
Terms
It is also a commitment on how the data product will be consumed in terms of:
- Usage
- Billing
- Notice period to adapt your consumption
Data quality test
As you may have seen in the assets within the data product, we distinguish between data quality tests that we call technical and those we call business. The former has a purely technical meaning, regardless of consumer expectations, and is defined by the technical teams.
The latter, defined within a data contract, aim to have a business meaning that validates the value we introduce and commit to for consumers (a duplication of lines may have a technical impact on storage costs and compute time without necessarily impacting the value we deliver).
State
The data contract is responsible for verifying its own state to allow the system to compare it with the commitments. It maintains the state of:
- SLAs
- Usage
- Billing
- Data quality test results
Data contract access request
The data contract is ready; now, it’s time to request access to subscribe to it! This is the role of the access request, which will include:
- Who wants to consume?: Data product, Technical team, Single user, or Data domain
- What is the purpose?
Platform components
I won’t go into detail on this part, not because it’s uninteresting, but because it deserves a dedicated article in my opinion.
The important thing here is that we want to use these resources to offer interfaces between data product developers and platform teams (Data Product Experience Plane and Infrastructure Utils Plane) to support the provision of a self-service platform, ensuring developer autonomy while offering decentralization through platform components implemented and provided by the platform (our famous LEGO).
Conclusion
There you have it — we’ve covered the core concepts that the console will support to enable teams to implement their data mesh. Let’s not forget one thing: we are still at the very beginning of development, aiming for an MVP with the basic concepts to start introducing the data mesh! Many concepts essential for a data mesh at scale and in the long run, such as polysemes, feedback loops, computational policies, etc., are still missing. We’ll get there!
The concepts are in place; the next step is the north star architecture of Alchemesh!