dbt Mesh: Powering Data Mesh — The Ultimate Guide

Published in

Joon Solutions Global

6 min readMar 19, 2024

While dbt is a powerful tool for data transformation, dbt Mesh unlocks its full potential within the Data Mesh architecture. This comprehensive guide delves into both concepts. We’ll explore how dbt Mesh builds upon the core principles of Data Mesh, empowering domain-specific data ownership and collaboration. This guide also identifies who can benefit the most from this approach, walks you through the implementation process, and provides solutions for potential challenges.

What is Data Mesh? What problems does it try to solve?

Pain point of Data teams

In the data monolith approach, a single team often handles all of the stages from ingestion, processing, and serving.

Data Monolith Approach — Image by the author

This approach works well on a small scale but will break down on a larger scale. Maintainability is painful for the central data team:

Hundreds of PRs are waiting to be approved at the end of the day >> Heavy workload for the central team
The more models, the longer the CI/CD run time
Higher chance of code conflicts

Furthermore, monolithic systems rarely have clear contracts or boundaries. This means that data formatting changes upstream can break an untold number of downstream consumers.

Data Mesh principles

Data mesh was born to solve all the problems above. A data mesh is a decentralized data management architecture comprising domain-specific data.

Data Mesh Architecture — Source: dbt Labs

In a data mesh framework, it enacts the following principles:

Why is dbt Mesh the ideal match to Data Mesh?

dbt Mesh allows you to operationalize data mesh better. dbt Mesh isn’t a single feature, is a pattern enabled by a convergence of several features in dbt:

Along with 1st principle: Domain-driven, dbt has Cross-project references that help you to separate your data into domain-driven projects

Example of dbt Multi-Projects — Source: dbt Labs

For the Self-service Principle, dbt offer Semantic Layer (centralizing metric definitions), dbt Explorer (cross-project lineage)
For Governance, there are features: Groups, Access, Model Versions, Model Contracts, in which you can clarify ownership of data products and define data formatting via model contracts.

Availability

A dbt Cloud Enterprise plan is required
Your account must be on at least dbt v1.6

Who can benefit the most from dbt Mesh?

Scenarios & How dbt Mesh can solve — Image by the author

From these observations, I think data mesh brings a bright future, particularly for:

Businesses that have complicated or fast-changing domains/ business lines (e.g., supply chain, logistics, e-commerce, etc.)
Businesses that have sensitive or expensive data need to be isolated (e.g., banking, financial services, etc.)
Businesses with a decentralized structure of data teams

How dbt Mesh mechanism works

Since DBT mesh is a new way of working, it could pose a lot of challenges to adopting it. Let’s dive into how the dbt Mesh mechanism works behind the scenes so that we can develop a better plan implementation.

Cross-project collaboration

Project dependencies: are acyclic. For example, if project B depends on project A, a new model in project A could not import and use a public model from project B.
Upstream project maintenance: If the maintainers of the upstream project wish to remove the model (or change its access modifier), this would be a breaking change for downstream consumers of that model. They should mark that model for deprecation (using deprecation_date), which will deliver a warning to all downstream consumers of that model.
Triggering upstream models in other projects: If you run `dbt build — select +model`, it will not trigger a run of upstream models in other projects unless downstream projects are installed as packages (source code).
Orchestrate job runs across multiple projects: dbt Cloud will soon offer the capability to trigger jobs when completing another job, including a job in a different project.

Permissions & access

Role-based access control (RBAC): dbt Cloud Enterprise plans support role-based access control (RBAC), which manages granular permissions for users and user groups. You can control which users can see or edit all aspects of a dbt Cloud project.
Model access: defines where models can be referenced. Models with public access can be referenced everywhere. Models with protected access can only be referenced within the same project. Models groups enable more granular control over where private models can be referenced.
Maintaining visibility on the entire organizational DAG: A central data team member can have permissions (at least read-only access) on all projects in a dbt Cloud account, they can navigate across the entirety of the organization’s DAG in dbt Explorer, and see models at all levels of detail.

High-level decision when implementing

To adopt dbt Mesh, you’ll need to consider these high-level areas:

Splitting projects: How do you determine where to split your DAG? Which models go in which project?
Git strategy: Mono-repo (multiple dbt Projects living in the same repository) or Multiple repos (one repo per project)?

Splitting projects

We can use this information to inform our decision to split our project apart.

Examine your jobs — which sets of models are most often built together?
Look at your lineage graph — how are models connected?
Look at your selectors defined in selectors.yml - how do people already define resource groups?
Talk to teams about what sort of separation naturally exists right now.

3 ways to split your projects — Image by the author

Git strategy

Small-to-medium-sized team: mono-repo setup
Large team: multi-repo setup

Solutions for potential challenges

While dbt Mesh offers a powerful approach to data management, there are some potential roadblocks to consider during implementation.

Shifting Mindset: Moving from a centralized data team to a decentralized model requires a cultural shift, with domain teams needing to embrace data ownership and collaboration.
Monitoring and Observability: With data spread across multiple projects, monitoring data pipelines and identifying potential issues can be difficult.
Standardization and Governance: Decentralization can lead to inconsistency in data quality, coding practices, and documentation.

The challenges mostly lie in the process, not the technical part. Good news! We’ve got a plan for you! It includes, but is not limited to: