Moving from Data Lakes to Data Mesh: Why Companies will continue to Decentralize their Data (Part II)

Prashant Khanal
DataReply
Published in
12 min readDec 12, 2022

Recap & Outlook

In the first blog post we described how and why organizations seek new solutions in storing and accessing data. We then presented the concept of a distributed Data Mesh, that replaces monolithic data platforms and rings the bell for a shift in paradigm from both technical and organizational point of view.

After reading this article (the 2nd in our mini-series) we hope to provide our readers with insights into the practicalities of working with a Data Mesh. The list of topics that this article aims to cover in-depth are:

  • Cost
  • Data Quality
  • Risks
  • Security
  • Performance
  • Adoption
  • Data Governance

1. Cost

Costs are typically classified into two types: hard costs and soft costs.
* Hard costs: include infrastructure’s hardware, circuits, and software.
* Soft costs: includes running, updating, maintaining, and supporting the deployment (of the infrastructure)

Data Mesh ingrains following core concepts that provide cost reductions for both Hard and Soft costs:

  • Centralized Orchestration and Maintenance

One of the key aspects of a Data Mesh is the ability to centralize connections to data sources in any place. Like any technology, infrastructure’s hard costs are often outweighed by its operating costs. Networking technologies are no different. These operating expenses can increase significantly when dealing with the difficulty of not being able to control both sides of a connection (as application providers often are).

Therefore, by centralizing the configuration and management of Data Mesh connections, application providers can streamline the support and maintenance of all connections.

The ability to centralize policy enforcement and other network security functions means fewer hours spent on routine tasks like compliance. This could also mean freeing up staff time to work on more important, value-added tasks, which can add up to hundreds of hours per month for application providers.

As a result soft cost savings increase exponentially with the number of connections being managed.

  • Eliminating Dedicated Connections

Data Mesh does not need expensive dedicated connections like Multiprotocol Label Switching (MPLS) or AWS Direct Connect. These connections have a bandwidth cap and can easily cost hundreds of USD per site each month.

In contrast, Data Mesh runs over any internet connection. While up-time and availability have historically been used as arguments in favor of MPLS, Data Mesh addresses these issues by utilizing already-existing internet connections that have been configured in redundant pairs, which become active in case of a failure in any of the existing paths. This high availability pairing guarantees that, in the event of a connection failure, the other connection will take over automatically without the need for manual intervention. This automated failover feature guarantees a 99.9% up-time SLA for key connections that must always be up.

By eliminating the need for dedicated circuits/connections and paying per MB of transfer rate, hard costs are drastically decreased while maintaining the same availability as legacy options.

For many application providers, integrating hard and soft costs of a Data Mesh can decrease the overall connectivity cost by 50%. These lower operational costs translate to more productivity and profitability of the application.

The overall effort of implementing a Data Mesh, either from scratch or from a data lake, depends on numerous factors and specific business needs (for instance, number of data sources being managed, number of products being offered, range of use cases, etc.).

2. Data Quality

Data Mesh is a data sharing scheme from peer-to-peer, which brings up the question: are there any ways to validate the correctness of data?

The main goal of Data Mesh is to decentralize data and have a cross-functional robust platform team. Thus, data is not limited to just data engineers but the team that acts as data owner and needs to verify the correctness of it. The owners need to take responsibility for the data they own. We can think of a Data Mesh like a micro-service architecture, exposing access to sources via RESTful services.

Data discovery and data catalogue are critical in the Data Mesh architecture to ensure that data is always up to date. Immuta’s Data Security Platform automatically scans, discovers, and classifies sensitive data from diverse data cloud sources to ensure that teams have an informed understanding of the data in their ecosystem. and classifies it, automating the process instead of having to do it manually. With unstructured data making its way, agreeing on a schema has become a tough challenge. Files stored in blob storage or NoSQL database make it ever more difficult to agree on a particular schema. With data doubling almost every two years, the importance of an efficient catalogue is significant. Immuta integrates with leading data catalogue providers to facilitate data access and governance policies that reference existing metadata in the data stack. This, in turn, enables the takes care of three essential features of metadata curation, data intelligence and data governance. Automation tools will help free teams from the burden of cataloguing data manually and will allow them to concentrate on real tasks at hand.

3. Risks

Building a Data Mesh is not trivial but adopting such a framework can be seen as a step forward towards the new era of platform engineering; and a platform usually brings in complexity that might not be easy to overcome.

While rolling out new software paradigms in production, experience is always a plus; it is not straightforward to give a summary of the possible risks, but we will try to highlight some of the common pitfalls along the way.

In the following paragraphs three usual challenges are described. The first risk is related to setup and the very beginning of a Data Mesh life cycle, the second and the third are more related to the operational side:

  1. Underestimating the setup time
  2. Failure to update data catalogues
  3. Adding unnecessary data domains

Underestimating the setup time

Estimating the time needed for software development of complex systems is probably one of the most difficult tasks. Despite the advancements made when moving from waterfall approaches to agile development cycles it is still impossible to predict the number of lines of code written per day by a team. Wouldn’t it be nice having a precise recipe like:

  • 1 data engineer
  • 3 developers
  • 2 data domains
  • Mix all together and cook overnight with low flame for at least 12 hours
  • Put your Data Mesh into fridge, wait one hour and serve cold.

Unfortunately, this is not the case. Time needed depends on many factors, and to give a specific answer, someone needs to study the specific environment variables. As a rule of thumb, we have always liked the 1.5-rule: once the estimation is done, multiply it by 1.5, by 2 if it is your first implementation in this field of technology.

Failure to update data catalogues

As we described in the previous article, discoverability is one of the key concepts of Data Mesh. Data access and data governance should be clear and well documented. In order to get features deployed as quickly as possible, documentation is often sacrificed. Out-of-date documentation often causes confusion or blockage to other teams reliant on the data, often resulting in further delay in accessing data and causing loss of both functionality and time. One team will stop work because “it cannot access the data”, the central team will take a few days to understand where the problem originated and only on the third day will someone notice that the endpoints given in the docs were out of date. One suggestion to avoid these kinds of problems is to adopt “doc-as-code” strategies so that with every change the documentation is adapted and kept in sync.

Adding unnecessary data domains

Data products can be considered the single unit of measure in a Data Mesh. Different domains define different data needs of our organization, that’s why a little overhead is needed in both organizations and operations. Data products require data owners and specific defined use cases; our data can potentially be used for any application, but each one of these applications must be carefully conceptualized and designed. Too many data products can be an issue from two different points of view: one is organizational and the other is resources related.

The organizational problem is faced when the managing overhead is multiplied by many data domains: a single person might need to administrate two or more data domains, and this may lead to a non-optimal solution. The resource problem can be well understood by looking at one of the possible data architectures for Data Meshes.

This architecture is only one of the possible implementations of Data Meshes from a high conceptual level. If you are interested to read more check this article written by @Piethein Strengholt.

The higher the number of domains, the more intensive the use of network is and the more expensive the overall architecture can be. There is no correct solution in this case but a need to find a balance between the number of domains, and the resource in an acceptable way depending on the use case.

Finally, please consider that multiple domains may have conflicting interests on how to use and access data. The shared resources might become a bottleneck and some different kind of Data Mesh might be a better choice.

4. Security

The Data Mesh paradigm is a relatively new concept in data platform design. Different implementations require different approaches and security cannot be left unconsidered. The great majority of security protocols are implemented in a centralized fashion. Consumers usually need to adapt to the decision made by the data owner. However, Data Mesh’s main idea is to decouple consumers and producers by the creation of many different data domains.

How to handle security in this environment?

On the one hand there is a great opportunity for customization. On the other hand, centralized architectures usually propose a single solution which hardly captures the consumer’s needs.

But where to apply security protocols?

It depends on your application, on your architecture implementation and most importantly on your specific security requirements. One suggestion could be to place security protocols in the data domain layer, but again this is just one of the multiple ways to implement security in Data Mesh. If your experience with security protocols in distributed data platforms is limited, using existing data security frameworks like Immuta can streamline your policy management and ensure universal applicability across varied cloud platforms.

5. Performance

This is one of the most interesting aspects because performance benchmarking for distributed data platforms still needs a fair amount of research. Like the weakest part of a system usually defines its strength, the same principle can be applied when talking about performance. A bottleneck, whether originating on hardware- software- or human side, defines the performance of the overall system. Now, a decentralized platform should help overcome bottlenecks by its nature of architecture. The best-case scenario should allow to:

  1. Reduce workload on central infrastructure team
  2. Enable data customers to work more independently from other teams
  3. Relieve data domain teams from time consuming “data understanding” because they own and know their data
  4. Increase flexibility since systems are decoupled and operate in “plug-and-play” mode
  5. Improve scalability since workloads can easily be scaled on suitable hardware

Yet, the flexibility of the paradigm also brings responsibilities. As described in the risk section: too many data domains can potentially bring up more problems than they solve.

6. Adoption

When it comes to adopting Data Mesh, there is unfortunately a no one-size-fits-all solution for companies. However, it is possible to come up with a high-level roadmap that can help provide some guidance through this process. Main idea is to start with a small end-to-end use case and gradually build upon this until at some point you can argue that you’ve built a Data Mesh.

First step

First, you need to get a strong commitment from the management side. Once you secure this you can go ahead and find a specific use case, ideally small in scope and with following ‘nice to have’ properties:

  • Few dependencies
  • Conceptually simple
  • Owned by a competent and open minded team
  • Easily observable from business point of view

Keep in mind that although you will be applying Data Mesh concepts covered in the last article, you still need to implement other important systems in conjunction with Data Mesh (e.g. Micro-service). The scope is going to be much wider than just applying Data Mesh concepts to a use case.

Below are some pointers from Tim Berglund, that can be useful when implementing a Data Mesh.

  • Nominate data owners. You should have firm owners for the key datasets in your organization, and you want everyone to know who owns which dataset.
  • Publish data on demand. You can store events in Kafka indefinitely, or they can be republished by data products on demand.
  • Handle schema changes. Owners are going to publish schema information to the mesh (perhaps in the form of a wiki, or data extracted from the Confluent Cloud Schema Registry and transformed into an HTML document), and you need a process to deal with schema change approval.
  • Secure event streams. You need central authority to grant access to individual event streams. There are probably regulatory concerns here — perhaps even actual laws.
  • Make a central user interface for discovery and registration of new event streams. This can be an application you create, or even a wiki. Ultimately, you’re going to need to support searching for schemas for data of interest. You also need to support previewing event streams and requesting access to new event streams, and you need to support data lineage views.

7. Data Governance

In today’s fast-moving world, organizations can capture massive amounts of internal and external data. To leverage this information and maximize its value, a discipline is needed to unite processes, roles, policies, standards, and metrics under one umbrella: data governance.

Vast amount of literature about data governance has been published, yet we believe that decentralized data platforms create the need for adjusted governance thinking.

The core of a Data Mesh is its collection of independent data products, with independent lifecycle, built and maintained by independent teams. However, for most use cases, business value gets created by combining data sets from different domains, which means teams and technology must interoperate. For any of these operations a Data Mesh implementation requires a governance model that embraces:

  • Decentralization
  • Domain self-sovereignty
  • Interoperability through global standardization
  • Dynamic topology
  • Automated execution of decisions by the platform

The question remains what decisions need to be localized to each domain and what decisions should be made globally for all. Ultimately global decisions have one purpose: creating interoperability and a compounding network effect through discovery and composition of data products.

Especially one aspect of data governance in a Data Mesh plays a critical role: data access control. Many topics are subsumed under this umbrella, including

  1. Universal cloud compatibility: Many enterprises operate in a multi cloud environment, where data stacks of different providers are used. With each platform using its own native access controls it is important to enable data access across these environments.
  2. Attribute-based access control: Defines an access control paradigm whereby a subject’s authorization to perform a set of operations is determined by evaluating attributes associated with the subject, e.g., geography, time and date, clearance level.
  3. Sensitive data discovery & classification: Vast amounts of data require an automated approach for identifying and classifying sensitive data. Scanning data and tagging it across multiple compute platforms eliminates manual, error-prone processes and allows for universal data access control.
  4. Dynamic data masking: Protecting data at query time by modifying or hiding sensitive values without changing the underlying data.
  5. Data policy enforcement & auditing: In order to be data security compliant data access and data manipulations must be monitored and logged continuously.

Each of these topics brings its own pitfalls, which is why it is recommended to use pre-configured frameworks or tool sets unless you are highly skilled, and you know what you are doing.

Immuta’s data security platform offers a one-stop-shop for secure data access in mesh architectures and provides a robust control foundation when implementing new Data Mesh.

We think that this two-part blog series gave you some insights into strengths and pitfalls of the Data Mesh paradigm of distributed data platforms. At Data Reply we support our customers in architecting and implementing large-scale data platforms of various kinds. Feel free to talk to us about any related topic.

Authors: Antonio Di Turi, Ayhun Tekat, Komal Lalwani, Jonas Pfefferle, Prashant Khanal

--

--

Prashant Khanal
DataReply

IT Consultant at Data Reply, with focus on Real Time Software solutions. Loves to read and try new tech, the hands on approach.