Build Data Product in Data Mesh using Templates
Data Mesh Overview
There are various data architecture patterns which have evolved over time from the last couple of decades in the Data Analytics space.
The first one being Data warehouse which started in late 1980’s and is still a popular architecture pattern. Then came Data Lake in the late 2000’s where we saw the problem of Big Data and Analytical systems were required to handle high velocity, volume and variety of data. Then with the advent of Cloud Computing, we saw Cloud Data Platforms or Modern Data Platforms which got popular in Mid 2010s. Then came Data Lake house which is a compromised attempt to bring in the strengths of both models i.e Data warehouse and Data Lake.
The most recent is the Data Mesh Architecture which is domain oriented and a decentralized pattern. A data mesh is an architectural and organizational framework which treats data as a product. In this pattern, data products are developed by the teams that best understand that data, and who follow an organization-wide set of data governance standards. Once data products are deployed to the data mesh, distributed teams in an organization can discover and access data that’s relevant to their needs more quickly and efficiently. This article focuses on building Data products using Templates or Recipes but before we go there let’s understand the role of various archetypes in a Data Mesh.
Data Mesh Team Archetypes
In Data Mesh, a number of archetypes or functions can exist but typically below are the key archetypes which are most common:
- Data Producers Team: They are the domain team aligned with a business unit and are responsible for creating and maintaining data products that generate business value.
- Data Consumers Team: They are the domain team aligned with a business unit and are responsible for consuming data products for various analytic applications. They can also use these data products to create another data product of their own.
- Data Governance Team: These are responsible for defining mesh-wide data policies and standards to facilitate data interoperability and ensuring data protection.
- Data Platform/Enablement Team: They are responsible for providing self-service data infrastructure platforms and templates/recipes to help autonomous domain teams to develop, run and manage data products. Domain teams use these components to build and deploy their data products. The data platform team also promotes best practices and introduces tools and methodologies which help to reduce cognitive load for distributed teams when adopting new technology.
Challenges with Domain Team
In the Data Mesh world, Domain teams have to be cross functional where they would be responsible for both application products and data products. Domain teams typically do not have deep data engineering expertise. They used to rely on the central data engineering team before Data Mesh adoption and lacked those typical data skills. And therefore a ready made template or recipe could be very useful. Let’s discuss the benefits of templates in more detail in the next section.
Benefits of Templates
Templates are very crucial for the adoption of Data Mesh Architecture patterns. These have been mostly overlooked by organization but now they are gaining traction because of below benefits:
- Data Producers: Most important benefit of building & using templates are that this would facilitate Domain teams to onboard their data by bridging some technical gaps who typically do not have much data engineering expertise.
- Speed-to-market: Using templates would speed up the data product development process. These templates could help in creating a Data product very quickly by just cloning the code, providing a few configuration details and executing it.
- Data Products: Data products would be consistent and would follow the same standards throughout the organization. This consistency is also important for ensuring the interoperability of data products, and that they could be used appropriately by the consumers.
- Data Consumers: Data Products built by multiple domain teams using standard templates would provide harmonious and defined experience for its consumer and hence consumption and exploration would be convenient.
Template or Recipe in Data Mesh
There could be various kinds of templates which could accelerate and standardize the build and consumption process in Data Mesh. In this article we would be focussing on Data Pipeline Templates built in the GCP environment.
Data Pipeline Templates: This template could be built by using various data analytics services such as Dataflow, Dataproc, BigQuery, Pub/Sub etc and could be used to set up required source, target and data ingestion pipelines for building Data products. For example Cloud Spanner to BigQuery Template, GCS to BigQuery template, Pub/Sub to BigQuery template etc
Please refer Data Mesh Self Service — Ingestion Pattern from Spanner to BigQuery where Sanchit Malhotra has explained Spanner to BigQuery data ingestion pattern and a self-service template for such a pattern which could be used by different domains to build data products in BigQuery.
There could be other templates as well such as below, which are not discussed in this article.
- Infrastructure Provisioning Templates.
- Data Pipeline Orchestration Templates.
- Data Product Discoverability and Marketplace Templates.
- Data Observability and Control plane Templates.
Reference Architecture in GCP
Below is a reference architecture in GCP for building a Data Product using Data Pipeline Templates.
The Data Platform/Enablement team publishes the production template in git repository comprising patterns and recipes. Domain team discovers templates/recipes made available by the Data Enablement team in a self service environment and identifies template(s) meeting their needs. Domain team then copy relevant Data Pipeline Templates in their project and auto-build their ingestion pipeline after updating the required configurations in the Configuration file.
*Please note that all these templates are custom and has to be build as per organization requirement.
Building Data Products using Templates
Typical user experience of using these templates by domain team would be as below:
- Explore and find out the right template for their use case. This should be ideally done through a self service platform but might also require consulting with the data platform/enablement team to verify.
- Clone the required template which could be in a version control tool such as git.
- Update the configuration as per requirement and deploy the solution in their environment.
- Verify data pipeline which was built using templates.
- Customize the auto built solution if required.
- Publish the Data product for relevant Data Consumers.
Templates Best Practices
- Templates should be customizable, one size doesn’t fit all.
- Templates should be enhanced and updated periodically.
- Templates should have proper documentation which tells which templates could be used for what use case and how to get started guide for domain teams.
- Templates should be built following organizational security guardrails.
- The Data Platform team should provide mechanisms to get feedback from the domain team and implement those wherever possible. For example a Data Ingestion template provides provisions of doing CDC process only using a date column only but some other Domains want to use it using a sequence number column. This should go as a feature enhancement request to the Data Platform team to improvise templates for a wider audience.
Summary
In this article we saw the need and benefit of using templates in Data Mesh Architecture. We also covered how templates can help in building Data products in a seamless way in GCP along with a templates best practices.
References:
For more information on how to build a Data Mesh on Google Cloud, please see the Google Cloud Data Mesh Whitepaper.
Hopefully this article would help you to understand a few key concepts around Data Mesh and Templates. Please provide your feedback and share if you find this useful. Thanks for reading!
Disclosure: The opinions expressed in this article are mine alone and do not necessarily reflect the views of my clients or Google.