The 4 Data Mesh Principles to Create a Data-Oriented RnD
Implementing data mesh principles to scale the data layer of your organization.
At Yotpo, we build a platform to serve E-Commerce organizations that includes multiple products, such as: Reviews and Ratings, Loyalty, Referrals, Visual Marketing and SMS Marketing.
Working in a multi-domain, microservice-oriented organization can create a maze-like architecture that complicates developing products. Ubiquitous data-sources are constantly created throughout engineers’ routine work.
Let’s take the example of a developer at Yotpo implementing a usage summary report across the different product lines. It’s a classic use-case for implementing a data pipeline to aggregate across multiple sources.
Now the developer, who may not be familiar with the organization’s data architecture, has to face new kinds of problems:
Are all the required data sources available?
Where is it stored?
How do we set up the ETL process to execute periodically?
How do we monitor it?
All these gaps may result in high reliance on the data platform engineering team, which can cause delivery delays, lack of sense of ownership and poor product quality due to the lack of the data engineer’s understanding of the data itself.
So it seems like there’s a tradeoff: to utilize the data-tools’ powers and be dependent on an external siloed team of data experts, or to be completely independent but stripped of the advantages of the data platform.
To solve this dilemma, we require a change in approach. Zhamak Deghani from ThoughtWorks provided the revolutionary concept of “Data Mesh”, one of the new fundamentals of data-engineering in the 3rd decade of the 21st century.
The basic idea of Data Mesh is to decentralize the monolithic approach to data that’s reflected by a single data lake/data warehouse and a single data group, by having the developer teams consider data as a product they serve to the organization.
If we want to move ownership of data pipelines and data assets to the dev teams, we have to empower them to use data products.
We can take an example from another field that recently shifted ownership from a specialized team to the generalist developer: DevOps, where the responsibility of deploying a service in production has gradually moved to generalist developers and SREs.
In this blog post, we will discuss some of the practices we use in Yotpo to deliver the data tools as a self-serve data infrastructure, and how we labeled data as a first-class citizen in our tech-stack.
If you are an engineer or a product manager that wants to scale your organization data infrastructure, this blog post applies to you.
These are the key principles that a self-service/scalable/distributed data platform relies on:
The multi-domain organization generates many data sources. Some of them might be of similar nature, such as a users table that exists in various products. The common monolithic organizational data architecture contradicts the common distributed application architecture. We usually create a single data-lake, with various data sources curated from different domains. This can create the situation of an overwhelming and confusing experience for the data platform user.
The most fundamental way to solve this pain is by using a data-catalog. Tools, such as AWS Glue Data Catalog or the open-source Hive Metastore, provides a simple mechanism to manage the metadata of the tables that may be lost in the hierarchical structure of the data lake.
Another super-useful set of tools is data exploration tools, such as Apache Atlas or Lyft’s Amundsen. These tools provide an index of the different tables, adding the ability to categorize, document, and track lineage of the data. More importantly, they enable data governance, which is a mandatory practice when establishing self-service data infrastructure. It’s also a required step in order to comply with the industry’s privacy standards.
At Yotpo, we use Hive Metastore, which is easily integrated with Spark, Airflow, EMR, Redshift and Databricks. For data exploration and governance, we use Apache Atlas along with Apache Ranger.
If we want our developers to ease into owning and maintaining the data in their domain, we must make the use of data tools painless and simple. To do so, we present a straightforward solution for each of the following:
We want our developers to be acquainted with the data. To do so, we must offer a scalable, performant and easy to use query engine that will be accessible for all users via IDE, SQL client integration or any other simple interface.
We must make sure the pipelines are using a standard, simple to use ETL tool that suits the organization’s sources and targets. The chosen ETL tool must be intuitive, reliable, and integrative with all the critical data sources, in order for it to be well adopted among non-data developers.
At Yotpo, we use Metorikku, an in-house open source development. Metorikku generalizes Spark batch and streaming jobs, using a simple descriptive YAML file with SQL steps.
Scheduling and Workflow Management
This is an inseparable part of executing ETLs. To enable developers to manage their jobs, we must offer generators to generate pipelines. The generators must use the standard data stack, and only need a small set of variables to operate.
At Yotpo we use Apache Airflow to schedule and manage data pipelines. In addition, we are using custom generators that support our standards, so that any developer, even one who has no Python experience, can effortlessly create and maintain new DAGs.
It’s advised to avoid a monolithic approach and to distribute the pipelines physically or logically use different deployments or permissions.
The next step of the evolution of data engineering comes hand-in-hand with a mature and robust infrastructure.
To successfully implement data mesh architecture, we must set high standards for the following:
Raw and materialized data must always be available. A single table can be used in multiple pipelines that are all reliant on its availability. Historically, availability is handled with a combination of immutable table versions and data-catalog. The more modern approach is handling updates using ACID events.
At Yotpo, we use Apache Hudi to manage and update our data lake parquets.
One of the major advancements in recent years is the use of real-time data processes, followed by frameworks, such as Apache Flink and Spark structured streaming. A distributed architecture requires a high persistence of data across multiple domains.
At Yotpo, we’ve pushed forward the principle of Creating a zero-latency data lake using Change Data Capture and Apache Hudi.
Modern infrastructure is heavily reliant on message brokers, such as Kafka, or orchestration platforms, such as K8S. We must be certain these are mature and stable enough as a shared infrastructure. It’s important to support auto-scaling, DR procedures, and present out-of-the-box monitoring.
In order to create the paradigm shift, we must incentivize our developers to adopt the data tools. This will inevitably create more domain-specific data across the data-mesh, and get developers accustomed to data quality standards.
Let’s take for example our implementation of Change Data Capture (CDC) using Debezium.
Enabling CDC over different microservice DBs helps to create complex distributed workflows, without having to write tedious and repeatable PubSub producers.
After we incorporated CDC in our infrastructure, we understood the high-demand we have for it in our RnD. This alone, naturally exposed our developers to a world of data products.
Developing and incorporating these kinds of solutions is one of the best ways to motivate software engineers to gear their stack with the data tools.
Building a data platform is a journey, with many milestones along the way. The platform is constantly evolving, use cases change and the constant need to withstand scale and complexity keeps rising. Following these guidelines will ensure you are following the right path, regardless of the current technology trend.