dbt — An overview of the supernova in data & analytics engineering
What is dbt (Data Build Tool)?
From dbt Lab:
dbt is a transformation workflow that helps you get more work done while producing higher quality results. You can use dbt to modularize and centralize your analytics code, while also providing your data team with guardrails typically found in software engineering workflows. Collaborate on data models, version them, and test and document your queries before safely deploying them to production, with monitoring and visibility.
In the modern ELT (Extract, Load, Transform) approach for data engineering, dbt takes care of the T (Data Transformation) as an open-source Data Transformation tool. The team behind dbt is called dbt Lab (old name: Fishtown Analytics).
Commonly used components
Models
Recommended architecture from dbt Labs
In dbt Lab’s official documentation, this structure is proposed as the best practice:
- Staging
- Intermediate
- Marts
Where, Staging Models are:
The staging layer is where our journey begins. This is the foundation of our project, where we bring all the individual components we’re going to use to build our more complex and useful models into the project.
And, Intermediate Models are:
The intermediate layer is where these molecules live, creating varied forms with specific purposes on the way towards the more complex proteins and cells we’ll use to breathe life into our data products.
Last but not least, Marts Models are:
This is the layer where everything comes together and we start to arrange all of our atoms (staging models) and molecules (intermediate models) into full-fledged cells that have identity and purpose. We sometimes like to call this the entity layer or concept layer, to emphasize that all our marts are meant to represent a specific entity or concept at its unique grain.
Tests
dbt Tests is:
Tests are assertions you make about your models and other resources in your dbt project (e.g. sources, seeds and snapshots). When you run
dbt test
, dbt will tell you if each test in your project passes or fails.
There are 4 types of Tests coming with dbt-core natively:
not_null
unique
accepted_values
relationships
The most typical use cases of test implementation would be Primary Key, which usually are applied with both the not_null
(https://docs.getdbt.com/reference/resource-properties/tests#not_null) and unique
(https://docs.getdbt.com/reference/resource-properties/tests#unique) test.
For business critical columns, it is also suggested to start with not_null
test to have a sense of how good the data quality is of those columns. Once more confidence in data quality of the column is gained, accepted_values
(https://docs.getdbt.com/reference/resource-properties/tests#accepted_values, for categorical columns) and relationships
(for Referential integrity).
For more sophisticated data quality tests, tests from dbt-utils and dbt_expectations can also be considered.
Macros
dbt Macros is:
Macros in Jinja are pieces of code that can be reused multiple times — they are analogous to “functions” in other programming languages, and are extremely useful if you find yourself repeating code across multiple models. Macros are defined in
.sql
files, typically in yourmacros
directory (docs).
Packages
dbt Packages are the additional library that could enhance the functionality of dbt.
Software engineers frequently modularize code into libraries. These libraries help programmers operate with leverage: they can spend more time focusing on their unique business logic, and less time implementing code that someone else has already spent the time perfecting.
dbt also hosts a hub of all supported packages on dbt here: dbt — Package hub .
Certain popular packages like dbt-utils and Elementary are widely implemented. Additional packages like Snowplow is also in the progress of implementation as of today.
Seeds
Seeds in dbt is:
Seeds are CSV files in your dbt project (typically in your
seeds
directory), that dbt can load into your data warehouse using thedbt seed
command.
In general, if there are business-defined tables (usually coming in the form of a Google Sheet import request), it is better to cover these requests with Seeds instead of building another Google Sheet import job if It is not expected to have very frequent changes of that table.
Here are a couple of benefits of using Seeds instead of Google Sheet import:
- Version Control
- Everything in 1 place: the CSV files are directly in the repo
- Enabling with
ref
functions, that improve the dbt Lineage & dependency management
Note:
- It is vital to make sure the relevant Airflow DAG has a task to run the
dbt seed
command (About dbt seed command) beforedbt run
, or else the table won’t be generated
Snapshots
Snapshots in dbt is:
Snapshots implement type-2 Slowly Changing Dimensions over mutable source tables. These Slowly Changing Dimensions (or SCDs) identify how a row in a table changes over time. Imagine you have an
orders
table where thestatus
field can be overwritten as the order is processed.
Analyses
Analyses in dbt is:
dbt’s notion of
models
makes it easy for data teams to version control and collaborate on data transformations. Sometimes though, a certain sql statement doesn't quite fit into the mold of a dbt model. These more "analytical" sql files can be versioned inside of your dbt project using theanalysis
functionality of dbt.
Some best use cases of Analyses are:
- Data Auditing
- Onboarding & Training purposes
- 1 off analysis
Further Readings
More content to be added & enriched in the future here.