How a Data Mesh pattern can solve data lake bottlenecks

Mani Kolbe
Contino Engineering
9 min readMay 31, 2022

--

A decentralised approach to data architecture for large enterprises.

Most companies today are sitting on a goldmine of information generated by their users, products and services. Data is the by-product of every digital action we take. There is no question that this is invaluable information for improving decision making, accelerating innovation and maximising customer experience. Yet the sheer size of organisations creates a lot of challenges to realise the full potential of this data. Siloed data, lack of standardisation of skillsets, misaligned teams, costs of proprietary data warehouses, etc are some of these challenges.

We have seen the popularity of data lakes to address these challenges. A data lake is a central location that allows you to store all your data at scale. This data can then be used for tasks such as reporting, advanced analytics, machine learning, etc.

The data lake architecture addresses many of the challenges mentioned above. It brings the data to a central location and standardise the data skillsets. It allows you to enforce company-wide policies. Data lake scales horizontally to fulfil any needs at a reasonable cost.

However, as time passes, we notice that this central data lake team has become a bottleneck. They need to maintain the ever-growing data storage, ensure the reliability of data pipelines and answer analytical questions from the business. The potential business units who could create value out of data must now go through the central data lake team, which doesn’t have any real business knowledge of the data it looks after. This eventually ends up in lack of trust and inability to leverage the data assets.

What can we do here? Enter the Data Mesh!

What is a Data Mesh?

A data mesh is a decentralised approach to data architecture. It is a new way of sourcing, managing, and accessing data for analytics use cases at scale. In a data mesh, data doesn’t sit together in a centralised pool. Instead, it is broken down into distinct ‘data products’ that are owned and managed by the domain teams closest to them.

This concept was originally introduced by Zhamak Dehghani, a principal technology consultant at ThoughtWorks. According to Zhamak, the data mesh is a “sociotechnical approach to share, access and manage analytical data in complex and large-scale environments — within or across organisations”. By sociotechnical here, we mean combining an understanding of how people and data in an organisation are organised with a technology architecture that compliments these structures. This is similar to the concept of micro-services in the software world. Data mesh architecture considers each group of human experts that manages a particular set of data as a “domain”. These domains are responsible for producing and maintaining “data products” that are then consumed by anyone in the organisation or outside in a self-serve manner.

Let's consider the example of an E-Commerce company. The fulfilment team generates the order history and shipment data. The marketing team relies on this data to measure the success of their campaigns. Here, the fulfilment team is the data producer and the marketing team is the consumer. In a data mesh model, the fulfilment team takes ownership of producing a user-friendly dataset instead of leaving that to a centralised data lake team who doesn’t understand the domain. This dataset is available to the rest of the company by registering it in a central data registry. Anyone interested can access it in a self-service manner.

The Four Foundational Principles

Domain Ownership

Organisations today are decomposed based on their business domains. Such decomposition localises the impact of continuous change and evolution, for the most part, to a domain.

A data mesh gives the data sharing responsibility to each of the business domains. Each domain becomes responsible for the data it is most familiar with: the domain that is the first-class user of this data or is in control of its point of origin.

When data is owned and controlled by the teams closest to it, it removes the number of steps and handoffs between data producers and consumers. This enables agility by reducing cross-team synchronisations and removing centralised bottlenecks of data teams, warehouses, and lake architecture.

This is a domain-driven design (DDD) approach to data architecture.

Data as a Product

Operational teams usually perceive their data as a by-product of running the business, leaving it to someone else, e.g. the data team, to pick it up and recycle it into useful datasets.

In contrast, data mesh domain teams apply product thinking with their data, striving for the best data user experience. It brings the aspiration to remove friction and makes data more user-friendly, like any other consumer product. Data-as-a-product expects that the consumers of that data should be treated as customers — happy and pleased.

In the E-Commerce company example earlier, the fulfilment team needs to make sure the data they are publishing is meaningful and user-friendly to its consumers. This includes adding documentation describing the data, sample queries for common use cases, gathering feedback from consumers etc.

The Self-Service Data Platform

The data mesh calls for a dedicated data platform team that is domain-agnostic and focused on building the platform tooling and the infrastructure. They are providing a product themselves. The consumers of this product are the data product developers from the domain teams.

The infrastructure for the data mesh (compute, storage, security, cloud services etc.) is fully or partially controlled by this platform team. Organisations can decide on the level of control for the platform team. For example, the data platform team may not even host the infrastructure directly, but rather provide an infra-as-code framework (eg: terraform modules) to spawn infrastructure adhering to the global policies.

The goal is to allow the domain teams to own the code, data, and policy that makes up a data product and focus on creating business value from data. They should not be getting bogged down in the maintenance of tooling or infrastructure.

The data platform must provide capabilities including (but not limited to) the following:

  • Registry and self-service portal
  • Monitoring, logging, alerting
  • Observability
  • Scalability
  • Cost structure
  • Data storage
  • Unified access control
  • Auditing
  • Federated Identity Management
  • Connectivity with analytical engines
  • Usability (e.g. SQL)

A typical data mesh design will have a central registry. The data producers can publish datasets in this registry and the data consumers can search for data on it. Publishing data products doesn’t actually mean pushing a physical copy into a central place. Instead, it just creates a pointer to the dataset from the registry.

This virtualisation of data (not the centralisation of data) is an important construct in the data mesh architecture. It allows data producers to use a federated platform with a common language, such as SQL, to provide a virtualised layer for “publishing” data products.

The “self-service” aspect gives domain teams the automated means to operationalise without the manual and hand-crafted assistance of centralised data lake experts. The platform team should ensure that there are easy-to-use capabilities in place for this. For example, a web portal or API for publishing/subscribing to data products and for configuring granular access. The implementation complexity of this is taken care of by the platform team and is not the concern of domain teams.

Federated Computational Governance

The fourth principle calls for a federal system. This is required for countering the undesirable consequence of domain-oriented decentralisation: incompatibility and disconnection of domains. It allows the building of global governance — rules applied to all data products and their interfaces.

Federated computational governance operating model

The governance is composed of a cross-functional team of representatives from domains, as well as platform experts and subject matter experts from security, compliance, legal, etc. Its objective is to ensure the availability of secure, compliant, privacy-respecting, and usable data across the board with managed risk.

This is enforced at the platform layer, ensuring standards are upheld without impacting flexibility or limiting how individual domains can use data. With the size of data scaling into petabytes, the question arises of how to enforce that governance — this is where the “computational” part comes into play. This is achieved via automation of conformity to regulations and standards.

For example, the platform team may deploy an AWS Lambda function to trigger whenever a new file is created. This function scan the file using machine learning and pattern matching techniques to identify and alert on personally identifiable information (PII).

Reference Architecture on AWS

Let’s look at how to implement a system on AWS which allows you to practice data mesh principles. Below the diagram is a high-level view of the architecture.

Data Mesh on AWS

As you can see, we have split the domains into separate AWS accounts. This is a general pattern for enterprises to provide autonomy for the line of businesses (domains). Each domain can be a data provider or consumer, or both. They capture and store their data in S3 buckets. This data is catalogued using AWS Glue.

There is a central governance account to which all other accounts are linked. This linkage is done using Amazon LakeFormation. The central account only stores the metadata of datasets exposed by producers. As such, data is still owned by producers. Amazon LakeFormation facilitates requesting and vending access to datasets to give consumers self-service access to the data lake for a variety of analytics use cases.

Consumers can search this data on the central catalogue and request access, which will be granted by data producers. Once access is granted, consumers can use these datasets with their choice of analytics and ML services, such as Amazon Redshift, Amazon Athena, Amazon EMR for Apache Spark, Amazon QuickSight, etc. They can also join data from multiple domains.

What if I already have a data lake — how do I adopt a data mesh?

In this case, we can take a slice approach similar to migrating a monolith application into a micro-services model. The original data lake account will be retained as the central account. Each domain is then moved into a separate account, one at a time and linkage is built up using AWS LakeFormation. This process is repeated until there are no datasets remaining in the central account.

How can Contino help?

The data mesh is an emerging concept. The industry is still in the process of creating tools and technologies around it. At the time of this writing (2022), there’s no such thing as a data mesh vendor. There are no off-the-shelf products, although many vendors claim their products support the data mesh concept. Also, it is a culture shift more than technology adoption. You build a data mesh, you don’t buy it.

This is where Contino can add value. We are a leading transformation consultancy that helps large, heavily-regulated enterprises to become fast, agile and competitive. We have many years of experience in building secure, scalable, cost-effective cloud data platforms and capabilities. We help enterprises to adopt game-changing approaches to delivering high-quality software at speed and scale.

These are also the essential ingredients to build a data mesh architecture.

The book “Data Mesh: Delivering Data-Driven Value at Scale” by Zhamak Dehghani is a great resource if you want to learn further about this. It is also credited for the ideas and images used in this article.

Further Reading

--

--