SqlDBM
Published in

SqlDBM

Data Mesh: overhyped, misunderstood, and useful!

Reaping the rewards of data mesh while avoiding the pitfalls of this trendy new framework.

Photo by Julius Drost on Unsplash

In a recent webinar, a question was put up for debate: “Are we bundling or unbundling the modern data stack?” Cloud data platforms like Snowflake, AWS, and GCP strive to meet their user’s data needs in a single, convenient, and cost-effective platform. But when it came time to build a case for “unbundling,” a less-common candidate took the spotlight: data mesh.

The fact that a heated debate promptly broke out about “the best way to do something” among the data community should surprise absolutely no one. What did seem strange was how little understanding of data mesh (whether in favor or against) there seemed to be among the participants.

C) all of the above

If you’ve read the original material by Data mesh’s creator, Zhamak Dehghani, Director of Emerging Technologies at ThoughtWorks, you might see why this is the case: it’s crammed with jargon to the point of obscurity. Adding to the confusion is the data mesh vendor hype touting familiar promises of “availability and accessibility at scale” and “faster time to value…without intervention from expert data teams.” Fool me once…

What is data mesh? It’s a framework of overlaying technologically-driven layers (hence, mesh) that, according to Zhamak, “shift from the centralized paradigm of a lake, or its predecessor data warehouse. [Drawing] from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.”

Statements like that are difficult to parse even though concepts like “decentralization” and “data domains” have been around since the dawn of data warehousing. So before taking apart what a data mesh is, it might be helpful to dismiss common criticisms of what it’s not.

Data Mesh — what it’s not

Inmon architecture

Because data mesh promises to deliver (and build upon) normalized, domain-centric data, many dismiss it offhand as a glorified Inmon architecture. While the two share similar components, the main difference is that Inmon architecture pulls data from the application/source (what data mesh refers to as a “data domain”) to a conformed model (DWH) to a global data warehouse.

In a data mesh, the normalized data assets are stored and curated by the data domain teams. They are made available for consumption to the rest of the company via a publish-subscribe model. Unlike Inmon, data mesh creates a 1:n relationship from the source layer to its consumers.

Data lake

Data lakes, which were assumed to be an iteration over the data warehouse, are minimally governed storage areas for raw domain data. By providing unlimited access to data of any type, data lakes attempted to get around the bottleneck of a centralized, tightly-governed warehouse.

Unfortunately, most data lakes suffered from data quality and discoverability issues and provided no recourse to domain experts who could help guide the end-user. Data mesh sets out to solve this challenge through decentralized data stewardship at the source, thereby avoiding the centralized mess colloquially referred to as the “data swamp.”

As Zhamak points out, it’s a problem on both ends: “I personally don’t envy the life of a data platform engineer. They need to consume data from teams who have no incentive in providing meaningful, truthful, and correct data. They have very little understanding of the source domains that generate the data and lack the domain expertise in their teams.”

Truer words have rarely been spoken.

Data fabric

First defined by Forrester analyst Noel Yuhanna back in the mid-2000s, the data fabric is essentially a technology-driven, metadata-focused rethink of data lake’s failures. Instead of centralizing and moving data into a lake, data fabric aims to provide a tech layer over disparate data sources. Through this, data fabric attempts to provide data access, discovery, transformation, integration, security, governance, lineage, and orchestration directly from the data source.

This sounds a lot like data mesh’s “Self-serve data platform” principle (which we explore below.) In truth, mesh and fabric are not competing but complementary concepts. While data mesh takes a more people-oriented approach and emphasizes data rigor, the tech approach to discovery and governance applies to both.

Monster Mash

I was working in the (computer) lab late one night.
When my eyes beheld an eerie sight.
Domain-oriented decentralized data streams arose.
With no impediments to discoverability imposed…

They formed the mesh!
They formed the data mesh.
They formed the mesh!
The APIs kept the data fresh.

If you spontaneously hum every time you hear the words “data mesh,” it’s probably thanks to Bobby Pickett’s perennial Halloween hit, “Monster Mash.”

True, a complete data strategy overhaul, such as what data mesh requires, may be fiendishly frightening to your CIO, so take due care in deciding if data mesh is a good fit for your organization. (Or you’ll find yourself humming a very different tune.)

Data mesh — what it is

Data mesh was created to overcome the ungovernability of Data Lakes and the bottlenecks of monolithic Data Warehouses. Using modern, distributed architecture and centralized governance best-practices, data mesh enables end-users to easily access and query data where it lives without moving or transforming it beforehand.

Data mesh is a domain-oriented design framework based on the following four principles:

  • Domain-oriented decentralized data ownership and architecture.
    Like podcasts, data domains (aka. business entities) publish the data they want to share using the platform of their choice. Stream or download using your favorite podcast player (ELT or query engine), but the show is hosted on the provider’s server.
  • Data as a product. Every data domain should own, promote, guarantee, and distribute its data assets as if they were products in a marketplace. Business teams coordinate with domain product owners to help create the required data product(s) for specific business needs.
  • Self-serve data platform. Despite varying technology stacks, decentralized data domains must still be coordinated by a centralized self-service layer for orchestration, access control, and data discovery. Data mesh proposes a three-tier approach with a provisioning plane for querying, UAC, and orchestration, a developer experience plane (think provisioning-plane-as-code), and a supervision plane that provides discoverability and governance.
  • Federated Computational Governance. If data mesh was the United States, the FCD would be the federal government and would get to decide how state (data domain) resources got aggregated and combined for countrywide use. Domain-specific data needs would be handled at the state level. And a simple Constitution (framework of decision rules) would dictate when a data problem was significant enough to require federal attention and funding.
Federated Computational Governance has a dome, just like the Capitol.

In practice

The four principles of data mesh form a solid framework. But should they be taken at face value or merely serve as a north star for laying an effective enterprise data strategy?

In a recent interview, Kent Graziano, former Chief Technical Evangelist at Snowflake, said, “It’s a concept of how we think about organizing the data and how we are going to empower organizations with access to that data — that is the underpinning of all of this [data mesh].”

Before swallowing the data mesh pill, it’s worth exploring the implications of this framework in practical terms.

  • Domain-oriented decentralized data ownership and architecture. It’s well understood what it means to “own” data in an organization but how many domains are big enough to own their architecture? In data mesh, data consumption shifts from a bottom-up source to warehouse load, to an ad-hoc usage-subscription pattern. Because most application-side databases are not equipped for an additional analytical load, this means that domain teams would likely need to support their own parallel warehouse architecture.
  • Data as a product. This principle is unique among the other four for having no technological underpinnings. It’s simply a wake-up call for organizations to reign in messy data. To take pride in their data assets as they do in their work because guess what? They’re the same thing.
  • Self-serve data platform. A fabric of tech solutions that enable domain teams to create and consume decentralized data products autonomously using platform abstractions is no trivial feat. This is what governance tools and data catalogs have struggled with since the early 2000s, and there is nothing fundamentally unique in data mesh to make that job more manageable.
  • Federated Computational Governance. Read literally: if you thought you could live without a dedicated warehouse for centralizing conformed data resources and a skilled team to make it happen, then you’ve got another thing coming. According to Zhamak, unlike traditional DWHs (which are apparently run by a bunch of stubborn conservatives ), “Data mesh’s federated computational governance, in contrast, embraces change and multiple interpretive contexts.” (Sounds a lot like a good old, dependable Kimball warehouse to me.)

Let’s itemize what it takes to make data mesh work.

Per data domain:

  • technology stack to support analytics workloads
  • product owner
  • developer resources to build and maintain data products
  • existing domain tech stack and subject matter experts to support the day-to-day operations of the department

At an enterprise level:

  • Executive buy-in for top-down data mesh enforcement
  • Data governance and cataloging tools to enable interoperability and data discovery between data domains.
  • Federated compute resources, a.k.a. centralized data warehouse for running aggregated analytics over multiple domain data sets.
  • Data warehousing and engineering team to support and develop the Federated Compute landscape.
  • Governance team to enforce standards and best practices and keep data domain product owners accountable for quality. This team also decides which projects to handle at the Federated Compute level and which to delegate to the data domains.

When put into perspective, the above scenario makes complete sense for data behemoths like Netflix or Microsoft, which operate in a multi-department, multi-cloud fashion. But, according to IDC’s State of the CDO research, 75% of organizations do not have a complete architecture in place to manage an end-to-end set of data activities, including integration, access, governance, and protection.

Most organizations are not Netflix or Microsoft. Data domains, most of whom lack training in data management, continue to struggle with data integrity. For many domains, their data solution is Jeff, the one-man IT department who read a book once, and spends his days uploading data to an FTP using cron jobs from SQL Server, running on a Linux box, under his desk.

Most organizations, even those that tick multiple V’s on the big data checklist, would do well to keep things simple and stick to the fundamentals.

While democratizing domain data access across an organization sounds excellent, most business users would gain little value from normalized data sets. “They want multidimensional views, the ability to find data in a format they can understand,” says Graziano. This brings us full circle to the boring old ELT approach: move once, reuse at scale.

[Business users] want multidimensional views, the ability to find data in a format they can understand. — Kent Graziano, data guru

Data mesh solutions — what to embrace

As technology evolves, big-name frameworks spring up as solutions to the problems of previous generations. Hadoop in answer to limitations of physical hardware, cloud in answer to the limitations of Hadoop. NoSQL, YesSQL, centralized ungoverned (lake), decentralized ungoverned (fabric) decentralized yes-governed (mesh). The list goes on.

So what are the biggest challenges that data mesh is attempting to overcome? And what can we leverage right now without downing the entire bowl of data mesh-flavored Kool-Aid?

Data as a product

Two elements must exist for something to be considered a product: variety and markets. The latter without the former is a monopoly, and the opposite, well, that’s communism.

In most organizations, neither condition is met: no variety as each domain is the sole provider of their respective data and no marketplace as there is no external alternative. No wonder a whole industry has sprung up around coaching, cleansing, and curating data.

When data is not a product (i.e., almost always)
The good news is that you don’t need to re-invent your entire data strategy and supporting architecture to get better data. Simply make data a priority. Data mesh gets it right: it’s an important issue that deserves to be one of the four core principles.

As data mesh suggests, data strategy requires top-down buy-in and bottom-up ownership. It requires that a data culture emerges at an organizational level where data is treated as the asset it truly is. It requires that data domains take pride and ownership of their data because investing the 20% effort to get it right at the source will save 80% of the effort to fix it downstream.

Can “domain-oriented decentralized data ownership and architecture” be simplified? Snowflake seems to think so. Their virtual warehouses separate storage and compute and provide the mechanisms to make sharing data easy. With Snowflake, organizations can leverage “decentralized” ownership while standardizing “architecture” through tight account-based security, data sharing, and zero-copy cloning.

Snowflake’s approach facilitates interoperability within an organization and makes it possible to have a single set of templates and best practices for data ownership and distribution. Apart from Snowflake’s cloud-native architecture, the following features exist to help make data access with an organization easy and secure:

  • Hierarchical object-based access control sets precise limits to who can see, query, or modify the data.
  • Zero-copy cloning enables instantaneous copies of productive data without replication or duplication in storage costs, saving time and reducing risk.
  • Snowflake’s row-level and column-level security features allow safe, governed, and compliant access to sensitive data.
  • Secure views protect underlying code, sources, metadata, and stop snooping attempts.
  • External tables provide easy access to data located in cloud storage.
  • Snowflake allows direct secure data sharing between accounts.
  • Not to mention the Data Marketplace and Data Exchange, which belong to a different category altogether (below).

No matter what tools you use, be sure to make data quality a priority at all levels, not just the BI team. While data (for most) may not be a “product” in a strictly economic sense, it is still the life-blood of organizational decision-making, and should not be allowed to become a by-product.

When data is a product
What if it were possible to leverage cloud-native data sharing and create a reliable and trusted data marketplace that integrates instantly with your existing data? Look no further than Snowflake Data Marketplace (SDM). Like a true marketplace, SDM provides cost-effective access to competing third-party data sets across nearly twenty categories.

SDM is built right into the data warehouse, so there is no ELT overhead involved in bringing third-party data alongside that of your organization — no risk of stale insights! Using SDM, organizations can instantly enrich their in-house data sources with dynamic insights like the latest weather, COVID stats, or market data to allow them to quickly react to change.

For those who are looking for something a little more intimate than a marketplace, Snowflake offers the data exchange. Using this feature, users can create their own data hub for securely collaborating around data with a selected group of invite-only members.

According to Snowflake, “it enables providers to publish data that can then be discovered by consumers.” Does that sound familiar?

Snowflake has already thought through modern data needs and provided a modern-era solution. Leveraging the Snowflake Data Cloud, users gain the complete autonomy of decentralized ownership while enjoying all the benefits of a shared architecture.

Beyond documentation pt. 1, Discoverability

Another important aspect of data consumption that data mesh gets right is the need for self-service. Giving domain teams the ability to create and consume data products autonomously using “platform abstractions.” Luckily, there already exists a universal, well-understood, and flexible data abstraction: SQL.

The best documentation is seamless and tightly coupled to existing workflows. Here, data cataloging solutions generally stumble: for all the features they provide, they require manual intervention from outside the development cycle to supply insights and need to be accessed separately.

Ideally, object-level documentation should live at the object level, where it can be accessed directly within the database or by any third-party data catalog. A tool like SqlDBM can simplify this task by providing a dedicated data documentation screen, excel upload/download, and pushing the changes out as DDL.

Data dictionary coupled with object DDL, accessible anywhere.

Just as important as contextual object-level descriptions are the relational inter-object dependencies. To effectively achieve interoperability between domains, users need to know the relationships between conformed dimensions.

Again, SQL can be your guide. By leveraging foreign key constraints already present in domain schemas, abstractions can be generated in real-time, at any level of detail. Always up-to-date, with no manual intervention, it’s all in the DDL.

One level of detail, multiple abstractions

Whatever tool or method you go with, make sure that a standardized data observability and discoverability layer exists in your organization, and you’re not left navigating an archipelago of poorly-documented databases.

Beyond documentation pt. 2, Governance

Whether or not you agree with data mesh’s federated governance approach, the importance of placing governance at the forefront of an enterprise data strategy cannot be overstated. Simply allowing a BI landscape to evolve “organically” (i.e., crossing fingers and hoping the warehouse team gets it right) is a tenuous proposal for two reasons:

  • The BI/DW team has no authority to impose any guidelines for the domain teams.
  • Due to delivery pressures and business needs, the BI/DW team is often forced into a conflict of interest, trading on quality and taking on technical debt to meet short-term deadlines.

Many governance teams suffer from similar handicaps. In many cases, they are not given any “teeth” to impose standards at the source or warehouse level and often lack the technical understanding required to implement an end-to-end data strategy.

Data mesh makes an important point: governance needs to align itself to the overarching corporate data strategy and have the authority to enforce that strategy at all levels. Whether this is achieved through a data mesh framework, Snowflake’s cloud-native architecture, by embedding governance rules into existing CICD workflows, or a combination of all three is a personal choice. The mistake is expecting good governance practices to evolve on their own or attempting to impose them after the fact.

Federated computational governance

As we saw earlier, the Snowflake Data Cloud, with governed data access and separation of storage and compute, is federated computational governance by data mesh standards. If you’re looking for “standards that are baked computationally into the platform,” here are a few more high-impact Snowflake features that should be part of your arsenal.

Effortless streaming

Not every use case will require near-instant data refresh, but for those that do, Snowflake provides a simple answer for continuous data loading from a domain to a DWH.

With Snowpipe, Snowflake enables data streaming from a stage to a physical table. For organizations that require the most up-to-date information on market sentiment or user activity, Snowpipe is a managed solution that is easy to configure and implement.

For continuous loading from Apache Kafka queues, Snowflake also provides a native connector. No matter what option you go with, there will be no jobs to schedule or processes to orchestrate to get the latest data to your analytics pipelines.

For non-real-time data, Snowflake offers Streams and Tasks to automate change data capture. A stream captures DML changes (i.e., inserts, updates, deletes, and related metadata) and opens them up for querying (by multiple targets if desired) in a transactional fashion. Paired with tasks (scheduled DML operations), streams can form continuous ELT workflows that process recently changed table rows.

Overall architecture

As it stands, data mesh provides very little guidance on what form Federated Computational architecture should take. We’re only told that it should embrace “decentralization and domain self-sovereignty, interoperability through global standardization, a dynamic topology and most importantly automated execution of decisions by the platform.”

Why didn’t I think of that?

Despite much meditation on the koan above, my thoughts always return to a tried-and-true three-tier ELT methodology. This includes:

  • the raw layer where data is landed in its original state
  • the transformed, cleaned, normalized data layer where self-service thrives
  • and a presentation layer where the BI team develops more complex models and presents them as denormalized tables for easy querying by end-users and dashboards

It’s simple, and it works. As to whether it meets data mesh standards, your guess is as good as mine.

Data mesh delusions — what to avoid

After covering the “best” of what data mesh offers, it is equally important to highlight the dangers inherent in such an ambitious framework. As with any guidelines, a bit of nuance is called for with data mesh.

Consider the following caveats before meshing up your data strategy:

  • “Data as a product” only when it is a product. If your data domain is run by Jeff, the one-man data team, data is likely to be a by-product at best.
  • “Data as product” implies a marketplace for competing solutions. Data domains tend to be unique in an organization, thereby creating natural monopolies, which, historically, have not favored the (data) consumer.
  • Decentralized data ownershipprovides the illusion of a single source of truth at the source but, in reality, creates multiple copies of data per subscriber (data sprawl).
  • In data mesh, domain-centric design aims to converge with self-service but, by definition, does not focus on multivariate, enterprise-wide uses of data. (Self-service users needing change history, I wish you luck building Type 2 SCDs.)
  • Data mesh’s “self-serve data platform” principle would either force application databases to support an additional analytical query load or force them to duplicate data to a dedicated OLAP solution.
  • “Federated Computational Governance” does not replace the job of the traditional DW team (correlating business entities for centralized analytics) — it only makes it harder due to decentralized data access.
  • Supporting diagrams for technical content should be drawn in crayon for maximal impact — said no one ever.
This is not going up on the fridge.

Conclusion

Are we bundling or unbundling the modern data stack? So long as we’re not bungling it and creating a data mess, the answer should be: it depends.

Is your organization is large enough to warrant a domain-oriented approach to data ownership and struggling with data management? In that case, a data mesh may be the right architecture to take it to the next level — facilitating data access and discovery among teams who, until now, have not been tightly integrated.

However, burning your ships and reworking the entire data stack to fit the data mesh doctrine is not required to reap its benefits. Most of the data challenges that data mesh aims to fix can already be addressed by native functionality in cloud platforms such as Snowflake (and the rest are process-driven.)

Now that you’ve understood what data mesh is, what it isn’t, its ambitions, and its shortcomings, what should you do?

When it comes to your data strategy, focus on the core principles of the data mesh and apply what makes sense to you and your organization.

When it comes to round-table discussions and online forums, feel free to take whatever side of the “revolutionary/ridiculous” divide and want, and have some fun.

--

--

--

All about SqlDBM — Cloud based Data Modeling Tool for Snowflake ❄️, AWS Redshift, MS SQL Server, PostGreSQL & MySQL

Recommended from Medium

Introduction to Seaborn

Relationship between vaccination rate and mortality rate

What are Bollinger Bands?

Do Exploratory Data Analysis (EDA) in 2 minutes.

3 Key Principles of Functional Programming for Data Engineering

Panjab University — National Institutional Ranking Framework (NIRF) Analysis

Panjab University — National Institutional Ranking Framework (NIRF) Ranking Analysis

Low Demand Concept

to .apply or not to .apply..

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Serge Gershkovich

Serge Gershkovich

Food for thought, meals essential. Shrine your mind, build your temple

More from Medium

10 reasons why you are not ready to adopt data mesh

Master Data Management in Data Mesh

The Art of Using Open Source to Open New Data Doors with the Co-Founders of OpenMetadata

Too many names for one customer or Master Data Management, part I