Alchemesh: Data mesh implementation on GCP (theory)
In the previous post on Alchemesh, we introduced the core concepts that our framework will manage, along with an initial version of the associated data model.
To advance the development of Alchemesh, we decided to move forward with two parallel work streams:
- Alchemesh Console: Building the console (user interface) and storing the core Data Mesh concepts (implementing our data models).
- Alchemesh Platform Components: Deploying an initial version of a Data Mesh by offering a self-service platform, and implementing our first platform “Lego” components.
This separation makes sense from our perspective since the console is intended to serve as an interface for any Data Mesh, regardless of how it’s deployed (integration can be done declaratively via the interface or API endpoints). Similarly, our platform components should be usable independently of Alchemesh or even a Data Mesh itself!
Ultimately, these two streams will converge and connect through the Alchemesh controller.
As you might have guessed from the title, this article aims to introduce the second stream. Our objectives for this stream are threefold:
- Develop our first platform components.
- Validate the structure of our Data Mesh concept models (data product, data contract, data contract access request, data domain, technical team, etc.) through a real-world example.
- Deploy an initial Data Mesh around which we can iterate to further evolve Alchemesh.
In this article, we present what we consider a technical scoping of the implementation we plan to carry out. Please keep in mind that this is just a preliminary overview — there will certainly be adjustments during the implementation, and we’re bound to make some mistakes in interpreting certain services or functionalities!
Data mesh on GCP
For this initial implementation, we decided to build our first Data Mesh on GCP. This choice is primarily due to technical convenience, as two out of the three contributors have extensive experience working with GCP.
Additionally, we see a strong alignment between the concepts managed by Alchemesh and those present in GCP.
⚠️ Alchemesh is not intended to be locked into any specific cloud provider. Our goal is to implement Data Mesh solutions on other cloud platforms in the future, as well as explore open-source stacks.
For this implementation, we will focus on the foundational elements that define our Data Mesh:
- Translating data domains
- Materializing technical teams
- Implementing data products
- Deploying central services managed by the self-serve data platform
- Establishing basic governance principles: identity management and access control
Architecture
As we will see in the following sections, the core component of our architecture will be Dataplex, which will enable us to materialize a large number of our concepts.
Self-serve data platform
The data platform manages all the centralized components within a dedicated GCP folder, ensuring governance, autonomy, and scalability of the mesh.
We will need to set up three GCP projects
Lakehouse
Its primary role is to centrally manage the implementation of data governance using Dataplex:
- Define data taxonomy and access rules for various attributes.
- Expose the status of output ports’ entities using tag templates.
- Represent data domains and data products through the declaration of lakes, zones, assets, and entities.
- Manage data contract access requests via the permission framework associated with lakes and zones.
Workload Identity Manager
To decouple the adherence of a data product from its domain or technical team as much as possible, we centralize the declaration of Service Accounts associated with workloads (i.e., data products) in a dedicated project. This ensures that any change in ownership won’t impact permissions since the identity associated with the workload remains unchanged, even if the GCP project changes.
Orchestrator
We will deploy Dagster, our data orchestrator, on a GKE instance in “control plane” mode. Dagster will be aware of all the declared data products (i.e., code locations) and will manage the scheduling of assets across the mesh.
Data domains
A data domain will be provisioned within GCP with a dedicated platform component, consisting of several resources.
A GCP group, which will include all the members of the data domain.
A folder under the analytics folder named after the data domain, which will manage all resources associated with the domain. By default, this folder will have two GCP projects:
lakehouse-proxy
: designed to host all GCP assets (BigQuery tables, GCS Buckets / BigLake tables) that make up the data products.orchestrator-proxy
: designed to host the deployment of the data product orchestration code. It will contain a GKE Autopilot instance with all the necessary configurations to integrate with the Dagster control plane.
A Dataplex lake with default permissions assigned to the data domain (linked to the associated GCP group).
Technical team
A technical team will be provisioned with the following resources:
A GCP group, which will include all members of the technical team.
For data product developer teams, within the domain, we will provision
An Infrastructure GCP project: This project will host all the transformations associated with a data product (orchestrated by the orchestrator) as well as the infrastructure components required to execute these transformations, such as:
- Dataproc clusters
- Internal storage elements for the transformation (e.g., GCS buckets, intermediate BigQuery tables, etc.)
- Other required resources.
A dedicated namespace on the orchestrator’s GKE cluster, with a Dagster “agent” acting as an intermediary with the control plane.
Data product
A data product is materialized as follows:
- Dataplex zone: Used to represent a data product.
- Dataplex assets: Associated with the zone corresponding to the data product and used to declare the output ports of the data product.
- Dataplex entities: Linked to the assets (i.e., output ports), representing the implementation of our data product models for each given output port. Data taxonomy attributes will be associated with these entities to qualify the columns.
- Service account: Dedicated to the data product and used by all workloads associated with it (Dagster orchestrator, transformations, etc.).
- Dagster location code: Contains the code used to orchestrate and implement the materialization of the data product’s output port entities.
- Provisioning of the necessary infrastructure: All infrastructure required for the data product to deliver value will be set up in the infrastructure project associated with the data product developer team that owns the data product.
- Template tags: Attached to the entities to manage their status (e.g., freshness, data quality, etc.).
- Permissions on the Dataplex zone: Granted to the technical team (associated GCP group) that owns the data product.
Data contract access request
We will adapt the data contract access request concept to define the following information:
Subject: Identifies who will be authorized.
- Data domain: Associated GCP group.
- Technical team: Associated GCP group.
- Nominative User: Associated GCP identity (email).
- Data product: Identity of the associated GCP service account.
Purpose: Must align with a predefined set of purposes (for example, a list of purposes defined in the organization’s consent strategy).
Column Level Condition (optional): A subset of available columns that the subject wishes to access.
Row-Level Condition (optional): A query condition that narrows the scope of the rows the subject wants to access (for example, a date range, a specific location, etc.).
This definition will allow for the provisioning of a validated access request with the following resources:
- Access to the output port (i.e., Dataplex asset).
- A data masking rules, where applicable.
- A column-level access rule, if defined.
- Row-level access rules: To filter consent according to purpose and based on the custom conditions defined in the access request.
With this, we have a clear definition of what we want to implement for our Data Mesh on GCP, as well as a clear understanding of the association between Alchemesh concepts and GCP resources!
We are now ready to move forward with the implementation of our platform components and attempt to deploy our Data Mesh!