dbt — An overview of the supernova in data & analytics engineering

Jimmy Pang
Data Panda
Published in
4 min readSep 1, 2023
The dbt logo.

What is dbt (Data Build Tool)?

From dbt Lab:

dbt is a transformation workflow that helps you get more work done while producing higher quality results. You can use dbt to modularize and centralize your analytics code, while also providing your data team with guardrails typically found in software engineering workflows. Collaborate on data models, version them, and test and document your queries before safely deploying them to production, with monitoring and visibility.

In the modern ELT (Extract, Load, Transform) approach for data engineering, dbt takes care of the T (Data Transformation) as an open-source Data Transformation tool. The team behind dbt is called dbt Lab (old name: Fishtown Analytics).

Commonly used components

Models

Recommended architecture from dbt Labs

In dbt Lab’s official documentation, this structure is proposed as the best practice:

  1. Staging
  2. Intermediate
  3. Marts

Where, Staging Models are:

The staging layer is where our journey begins. This is the foundation of our project, where we bring all the individual components we’re going to use to build our more complex and useful models into the project.

And, Intermediate Models are:

The intermediate layer is where these molecules live, creating varied forms with specific purposes on the way towards the more complex proteins and cells we’ll use to breathe life into our data products.

Last but not least, Marts Models are:

This is the layer where everything comes together and we start to arrange all of our atoms (staging models) and molecules (intermediate models) into full-fledged cells that have identity and purpose. We sometimes like to call this the entity layer or concept layer, to emphasize that all our marts are meant to represent a specific entity or concept at its unique grain.

Tests

dbt Tests is:

Tests are assertions you make about your models and other resources in your dbt project (e.g. sources, seeds and snapshots). When you run dbt test, dbt will tell you if each test in your project passes or fails.

There are 4 types of Tests coming with dbt-core natively:

  • not_null
  • unique
  • accepted_values
  • relationships

The most typical use cases of test implementation would be Primary Key, which usually are applied with both the not_null (https://docs.getdbt.com/reference/resource-properties/tests#not_null) and unique (https://docs.getdbt.com/reference/resource-properties/tests#unique) test.

For business critical columns, it is also suggested to start with not_null test to have a sense of how good the data quality is of those columns. Once more confidence in data quality of the column is gained, accepted_values (https://docs.getdbt.com/reference/resource-properties/tests#accepted_values, for categorical columns) and relationships (for Referential integrity).

For more sophisticated data quality tests, tests from dbt-utils and dbt_expectations can also be considered.

Macros

dbt Macros is:

Macros in Jinja are pieces of code that can be reused multiple times — they are analogous to “functions” in other programming languages, and are extremely useful if you find yourself repeating code across multiple models. Macros are defined in .sql files, typically in your macros directory (docs).

Packages

dbt Packages are the additional library that could enhance the functionality of dbt.

Software engineers frequently modularize code into libraries. These libraries help programmers operate with leverage: they can spend more time focusing on their unique business logic, and less time implementing code that someone else has already spent the time perfecting.

dbt also hosts a hub of all supported packages on dbt here: dbt — Package hub .

Certain popular packages like dbt-utils and Elementary are widely implemented. Additional packages like Snowplow is also in the progress of implementation as of today.

Seeds

Seeds in dbt is:

Seeds are CSV files in your dbt project (typically in your seeds directory), that dbt can load into your data warehouse using the dbt seed command.

In general, if there are business-defined tables (usually coming in the form of a Google Sheet import request), it is better to cover these requests with Seeds instead of building another Google Sheet import job if It is not expected to have very frequent changes of that table.

Here are a couple of benefits of using Seeds instead of Google Sheet import:

  1. Version Control
  2. Everything in 1 place: the CSV files are directly in the repo
  3. Enabling with ref functions, that improve the dbt Lineage & dependency management

Note:

  • It is vital to make sure the relevant Airflow DAG has a task to run the dbt seed command (About dbt seed command) before dbt run, or else the table won’t be generated

Snapshots

Snapshots in dbt is:

Snapshots implement type-2 Slowly Changing Dimensions over mutable source tables. These Slowly Changing Dimensions (or SCDs) identify how a row in a table changes over time. Imagine you have an orders table where the status field can be overwritten as the order is processed.

Analyses

Analyses in dbt is:

dbt’s notion of models makes it easy for data teams to version control and collaborate on data transformations. Sometimes though, a certain sql statement doesn't quite fit into the mold of a dbt model. These more "analytical" sql files can be versioned inside of your dbt project using the analysis functionality of dbt.

Some best use cases of Analyses are:

  • Data Auditing
  • Onboarding & Training purposes
  • 1 off analysis

Further Readings

More content to be added & enriched in the future here.

--

--

Jimmy Pang
Data Panda

Data Leader, Evangelist and Educator, dedicated to the data journey. Interested in tech and classy stuffs: art, whiskey, coffee, tea, spirituality, history etc.