dbt Models — Marts layer (i.e. Data Marts)
Business facing entities, ready for reporting, ad-hoc analysis, Machine Learning and Reverse ETLs
Table of Contents
- What are Mart models in dbt?
- Types of Data Marts — Dimension, Fact, Summary, Snapshots, and Manual tables
- Data Modelling for Marts & Naming Conventions
- What is an Entity, and Entity Relationship (ER)?
- Where is the correct place for Mart models in dbt repo?
- How are the Mart models being run?
- DOs & DON’Ts
- Further considerations
- Appendix
What are Mart models in dbt?
From the dbt Lab, a very informative description was put here:
This is the layer where everything comes together and we start to arrange all of our atoms (staging models) and molecules (intermediate models) into full-fledged cells that have identity and purpose.
We sometimes like to call this the entity layer or concept layer, to emphasize that all our marts are meant to represent a specific entity or concept at its unique grain. For instance, an order, a customer, a territory, a click event, a payment — each of these would be represented with a distinct mart, and each row would represent a discrete instance of these concepts.
Unlike in a traditional Kimball star schema though, in modern data warehousing — where storage is cheap and compute is expensive — we’ll happily borrow and add any and all data from other concepts that are relevant to answering questions about the mart’s core entity.
Building the same data in multiple places, as we do with
orders
in ourcustomers
mart example below, is more efficient in this paradigm than repeatedly rejoining these concepts (this is a basic definition of denormalization in this context).
- The Kimball model (aka Dimensional Modelling) is more for traditional RDBMS(Relational Database Management System), e.g. MySQL, MariaDB (which is just a fork of MySQL), PostgreSQL, AWS Redshift (i.e. a fork from PostgreSQL 7.0.1) etc.
- Modern OLAP(Online Analytical Processing Platform)/Data Warehouse solutions like Snowflake and BigQuery are non-relational, hence the best practices and optimization directions are completely different
Types of Data Marts
Important Note: To maximize the performance & querying from stakeholders side, and also the the user experience of the Data Products, de-normalized Fact table building should always be prioritized over building Dimension tables.
In the other words, it is always a better experience for analysts while they don’t need to do many JOINs at their own queries if the needed columns are always in the Fact table, e.g. the Sales table should already have ISO Week, Year Number, Country Codes, etc. So the analysts don’t need to JOIN with Date & Country dimensions all the time.
The most common types of Data Marts are Dimension and Fact tables. As time evolves, there are more types of Data Marts being introduced and might makes sense to keep:
- Dimension (
dim
) - Fact (
fct
) - Summary (
sum
) - Snapshots (
snp
) - Manual (
mnl
)
Dimension tables
From Wiki:
A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time.[1][2].
Good Examples would be:
- dim_countries
- dim_customers
- dim_shops
- dim_locations
Fact tables
From Wiki:
In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is located at the center of a star schema or a snowflake schema surrounded by dimension tables.
Good Examples would be:
- fct_orders
- fct_order_status_transitions
- fct_shipments
Summary tables
Summary tables are pre-computed & pre-aggregated Data Marts. Summary tables would contain pre-defined dimensions (e.g. Dates, Country, etc.) and agreed KPIs (e.g. number of orders, GMV, etc.)
Good Examples would be:
- sum_performance_kpi
- sum_customer_metrics
Notes
- To build a new summary table, it is always a good idea to look at the reporting done previously already
- In a conceptual sense, Summary tables are actually similar to Metrics Layer (aka Semantic Layer)— the key benefits are centralized logic and much better query performance during consumption
Snapshots tables
Snapshots table is a type of Data Mart that would merely capture the state of an existing table, and store the data as a separate table. The best use case of it would be SCD (Slowly Changing Dimensions).
The dbt Snapshot is the recommended way to do so.
Examples in Vestiarie would be:
- snp_country
- snp_order_status
Manual tables
Manual table is a type of Data Mart that is manually generated by any means (e.g. 1 off copy of csv file from S3, Google Sheet import, static SQL query, dbt seeds etc.)
Unless there is a strong business reason (e.g. the manual data is expected to change frequently), dbt Seeds would be the recommended way for Manual tables.
Examples in Vestiarie would be:
- mnl_shipment_status_mapping
Data Modelling for Marts
Different organizations have different way to structure the Data Marts. Some would only have 2 schemas:
dimension
stores all Dimension tablesfact
stores all Fact tables
Following this pattern, there should also be dedicated schemas for Summary, Snapshot, and Manual tables.
There is also another way to structure so: by domain, e.g.
dwh_ops
: All data marts related to Operationsdwh_mkt
: All data marts related to Marketingdwh_fin
: All data marts related to Finance
There is a catch here though — If there is a personnel change, re-org, or scope changes, the ownership of the tables could become messy if there is no proper Data Catalog solution in place.
Data Marts naming conventions
For Data Marts, it would be a great idea to follow this naming convention:<dim/fct/sum/mnl>_<BU>_<domain>_<entity>
dim/fct/sum/mnl/snp
It indicates table type, where:
- dim: Dimension table
- fct: Fact table
- sum: A summary table that only reports aggerated & pre-computed metrics
- mnl: Manual table, usually generated from Google Sheet imports
- snp: Snapshot table
BU (i.e. Business Unit)
Mostly for organization having multiple brands, and the naming convention of the BUs really varies a lot from organizations to organization. The rule of thumb is maintain consistence
Domain
Using predefined domain name 3 letter abbreviations, e.g. ops
for Operations, fin
for Finance etc.
Entity
The Entity stands for the “Entity” in the ER diagram, details goes as the below.
What is an Entity, and Entity Relationship (ER)?
According to Wikipedia, an entity is:
In computer science, an entity is an object that has an identity, which is independent of the changes of its attributes. It represents long-lived information relevant for the users and is usually stored in a database.
Even in the modern days, most of the popular OLAPs (e.g. Snowflake, BigQuery) are no longer relational. Yet, the concept of the Entity Relationship (ER) Model is still very vital in terms of Data Marts design.
ER modeling is a top-down structure to database design that begins with identifying the important data called entities and relationships in combination with the data that must be characterized in the model. Then, database model designers can add more details such as the information they want to hold about the entities and relationships, which are the attributes, and any constraints on the entities, relationships, and attributes. ER modeling is an important technique for any database designer to master and forms the basis of the methodology.
A relationship type is a set of associations between one or more participating entity types. Each relationship type is given a name that describes its function.
Where is the correct place for Mart models in dbt repo?
In the Jaffle Shop example from dbt Lab, the Mart models are directly stored under the models
folders:
It is not a bad idea ot start with. But as time goes, it might becomes hard to manage when the number of Mart models grow. Hence, it may be a better idea to use this structure instead:
models/marts/{{mart_schema_name}}
Or, if the idea of Data Mesh or even more complex architure is planned ahead, it may be something like:
models/marts/core/{{mart_schema_name}}
models/marts/ops/{{mart_schema_name}}
models/marts/fin/{{mart_schema_name}}
How are the Mart models being run?
Much like another article mentioned, the commend dbt run
supports various Node Selection Methods. The recommended way remains as the “tag” method (e.g. dbt run --select tag:staging
).
The “Path” method (e.g. Models in the folder daily_01h00
) is also a most commonly used Method (e.g. dbt run — select path:models/staging
).
The reason why “tag” method is better is that it frees the data team from locking all .sql
files in certain directories, and allows the team to organize the files in a more logical way.
DOs & DON’Ts
The followings comes from the dbt Labs:
DOs
- ✅ Group by department or area of concern: If you have fewer than 10 or so marts you may not have much need for subfolders, so as with the intermediate layer, don’t over-optimize too early. If you do find yourself needing to insert more structure and grouping though, use useful business concepts here.
- ✅ Name by entity. Use plain English to name the file based on the concept that forms the grain of the mart
customers
,orders
. Note that for pure marts, there should not be a time dimension (orders_per_day
) here, which is typically best captured via metrics. - ✅ Materialized as tables or incremental models: Once we reach the marts layer, it’s time to start building not just our logic into the warehouse, but the data itself. This gives end users much faster performance for these later models that are actually designed for their use, and saves us costs recomputing these entire chains of models every time somebody refreshes a dashboard or runs a regression in python. As always, start simple and only add complexity as necessary. The models with the most data and compute-intensive transformations should absolutely take advantage of dbt’s excellent incremental materialization options, but rushing to make all your marts models incremental by default will introduce superfluous difficulty.
- ✅ Wide and denormalized. Unlike old school warehousing, in the modern data stack storage is cheap and it’s compute that is expensive and must be prioritized as such, packing these into very wide denormalized concepts that can provide everything somebody needs about a concept as a goal.
- ✅ Build on separate marts thoughtfully. While we strive to preserve a narrowing DAG up to the marts layer, once here things may start to get a little less strict. A common example is passing information between marts at different grains, as we saw above, where we bring our
orders
mart into ourcustomers
marts to aggregate critical order data into acustomer
grain. Now that we’re really ‘spending’ compute and storage by actually building the data in our outputs, it’s sensible to leverage previously built resources to speed up and save costs on outputs that require similar data, versus recomputing the same views and CTEs from scratch. The right approach here is heavily dependent on your unique DAG, models, and goals — it’s just important to note that using a mart in building another, later mart is okay, but requires careful consideration to avoid wasted resources or circular dependencies.
Don’ts
- ❌ Build the same concept differently for different teams:
finance_orders
andmarketing_orders
is typically considered an anti-pattern. There are, as always, exceptions — a common pattern we see is that, finance may have specific needs, for example reporting revenue to the government in a way that diverges from how the company as a whole measures revenue day-to-day. Just make sure that these are clearly designed and understandable as separate concepts, not departmental views on the same concept:tax_revenue
andrevenue
notfinance_revenue
andmarketing_revenue
. - ❌ Make the Mart model specific to a service, or a segment of a business unit/ country (e.g.
dim_vc_ops_zendesk_users
,DIM_VC_PRD_PIM_PRODUCT_COLOR_KR
) - ❌ Too many joins in one table: more complicated a table is, higher the chance to have issues (e.g. duplications)
Further considerations
- Troubleshoot via tables. While stacking views and ephemeral models up until our marts — only building data into the warehouse at the end of a chain when we have the models we really want end users to work with — is ideal in production, it can present some difficulties in development. Particularly, certain errors may seem to be surfacing in our later models that actually stem from much earlier dependencies in our model chain (ancestor models in our DAG that are built before the model throwing the errors). If you’re having trouble pinning down where or what a database error is telling you, it can be helpful to temporarily build a specific chain of models as tables so that the warehouse will throw the error where it’s actually occurring.
- After marts: the activation layer. In the same way that our staging models are building blocks for our marts, that also offer us direct views into specific source data, our marts are building blocks for our final outputs that also offer direct views into specific ideas. You can use marts directly, but they are equally important as components for building models in the activation layer after marts.
Appendix
- Marts: Business-defined entities | dbt Developer Hub
- Ralph Kimball — The father of data warehousing
- Dimensional modeling
- Entity Relationship(ER) Model
- dbt Tests
- dbt Models — Staging layer