The Power of Data Build Tool (dbt)

A Guide for Analytics Engineers

Andy Sawyer
9 min readJun 16, 2024

Data Analytics is an ever-evolving process, and there are many tools out there to help improve workflows and deliver outcomes. Since early 2018, I’ve been leveraging Data Build Tool (dbt) to rapidly build data models in various data warehouses across a number of companies and industries. Over the years, I’ve utilised both dbt-core and dbt-cloud, seeing first-hand their capabilities in driving business value. Additionally my partner is a dbt certified developer and regular dbt trainer for the consulting company she works for.

In this article, we’ll delve into some of the standout features of dbt, and I’ll guide you through a technical solution using DuckDB, complete with screenshots, code snippets, and a supporting GitHub repository.

Features of dbt

dbt stands out as a robust tool for data transformation, offering a suite of features that streamline the development and management of data models. It’s important to note that dbt is not an ETL tool. It focuses specifically on the transformation step. But whether you’re ensuring data integrity through testing or generating comprehensive documentation, dbt provides the necessary capabilities to enhance your analytics workflow. Below, we look at some of the key features that make dbt an indispensable tool for Analytics Engineers.

Model Testing for Integrity

One of the most powerful features of dbt is its ability to test your data models for integrity. By integrating testing directly into the data transformation process, dbt ensures that data quality issues are identified and resolved early. This capability is crucial for maintaining the reliability of your analytics.

There are a number of tests that come ‘out of the box’ including:

  • unique tests whether all values in a column are unique
  • not_null tests whether there are any NULL values in a column
  • accepted_values checks whether any values deviate from a defined list (good for enums)
  • relationships checks that the values in a column exist in a column in a different model (good for foreign keys)

You can however define any test that you want. Once your tests are defined, running them is as easy as running dbt test and checking that everything passes:

dbt tests

Additionally, as of v1.8, dbt has introduced unit tests. This is a huge advancement, as while the unittest package has been available in Python for a long time (version 1.0.0 was released in February 2022), the ability to unit test your SQL is very new. While there was limited scope for unit tests in my very simple example code, you can read more about unit tests here.

Building Documentation Websites

Another noteworthy feature of dbt is its ability to automatically generate comprehensive documentation for your data models. This documentation is built from the yaml files in your project and can be enriched with both model and record level descriptions and metadata, providing a clear understanding of your data pipeline.

Documentation can be generated by typing dbt docs generate && dbt docs serve from the command line:

Generating dbt docs

Your web browser will open at the docs. The below shows the documentation for my dim_employee table, including a description of the table, the fields with data types, their descriptions, as well as any tests that have been defined for them.

dbt docs

Data Lineage

Within the documentation site, there is a link at the bottom right to show data lineage. This is hugely valuable, showing both upstream and downstream models, plus sources and exposures:

dbt lineage

Here we can see dim_employee in purple as the selected model references two models (our stages identified in blue), which in turn reference two prior models (our snapshots) and then external sources (identified in green). We can see it is used in one downstream model, and is exposed to the business and used in a dashboard (identified in orange).

Data Contracts

As part of the definition of models within the schema.yml it is possible to define the data types for each field. If the data types change without being updated here, the model will fail. This ensures that the created table schema remains consistent even during code changes, or when changes do occur you’re aware of the change and can inform downstream users before the change hits production.

schema.yml with table and field names, data types and tests

Types of Models in dbt

In dbt, you can create different types of models, each serving a specific purpose in your data pipeline. These include:

  • Snapshots: Capture the state of your data at a specific point in time, with valid from and valid to dates similar to a Type 2 SCD. I use snapshots on all mutable sources. It helps enormously when someone tells me the data looks different to when they looked at it last week and they want to know why.
  • Tables: Materialise the results of your queries as persistent tables in the database.
  • Views: Create virtual tables that provide a dynamic view of your data.
  • Incremental Models: Process and store only the new or changed data since the last run, optimising performance and storage.

Technical Solution: Building a dbt Project with DuckDB

Now, let’s dive into a step-by-step guide to building a dbt project that leverages DuckDB. We’ll use a simple project that pulls sources from CSV files and builds out a simple fact and dimension model. I’ve created a couple of different types of dimension model, including a Type 2 Slowly Changing Dimension that leverages two underlying snapshots.

Step 1: Setting Up the Environment

To get started, you’ll need to install dbt and DuckDB. Detailed instructions can be found in the official dbt and DuckDB documentation, but it’s basically just a matter of creating yourself a virtual Python environment and then typing:

pip install dbt-core dbt-duckdb

Step 2: Creating the dbt Project

Initialise your dbt project and configure it to use DuckDB as the data warehouse. You do this by telling dbt to initiate a new project build:

dbt init

You will be guided to give your project a name, and choose the database that you would like:

Step 3: Defining Sources and Models

Create your source definitions in dbt to pull data from CSV files. Then create your snapshots within the snapshots folder if those sources are mutable.

A mutable source (as opposed to an immutable source) is one that can change history. Take a CRM with an account table. If it’s possible to update the details of the account within the CRM and then the old version is no longer available, that data is mutable. Event logs are immutable. Once they’ve been written, they can never be updated by the source system.

Once we have our snapshots, we can create staging views of these where we can implement our ‘hard rules’, and finally define your fact and dimension models within the models folder.

A hard rule (as opposed to a soft or business rule) is a rule that is related to the data rather than the business. If you look at the stage models I have defined, I’ve implemented hard rules around the strings to ensure they are fixed at a specific size. You can also fix a datetime that only contains a date to a date type for example.

This is what my final structure looks like:

dbt project

One of the things I like about using dbt with DuckDB is that while dbt only focuses on the transformation step, the capabilities of DuckDB allow you to reference external tables. In the below screenshot, I’ve pointed to four csv files in the demo_files folder. Far easier than needing to load the data into the database before being able to reference it:

Once the sources have been created, I can reference them in my model using Jinja. This is how dbt knows to create the lineage and relationships between the sources, models and exposures.

In the screenshot below, you can see how I’ve built out one of the snapshot tables. There is no requirement to CREATE TABLE, or even worry about any of the logic to create a MERGE statement. dbt does this for you. You simply write the SQL statement, with some defined code within the Jinja blocks.

dbt snapshot

The config at the top defines the schema that the snapshot will be created in, the unique key and the strategy for building the snapshot.

Similarly, we can build out our other models in the same fashion. Below is a more complicated Type 2 Slowly Changing Dimension table, sourced from two underlying snapshots that each have their own valid to and from dates. I’ve added my -1 and -2 values into my dimension table as a simple UNION at the end of the query (off screen). I’m joining two dbt snapshots together to show changes over time across both tables. I’ve not added the target schema into the config block as I did with the snapshots, so it will fall back to using the default schema:

Slowly Changing Dimension

My fact table joins in the dimensions, allowing me to identify the correct mapping to -1 or -2 values, as well as helping the documentation show the relationships. This approach ensures an analyst can always join the tables using an INNER JOIN. Note the join to the dim_employee table uses date ranges as it’s linking to a Type 2 SCD:

Fact Table

Step 4: Running Tests and Building Documentation

Run your dbt tests to ensure the integrity of your models. Generate the documentation website to get a clear overview of your data pipeline.

(See further up the article for details)

Step 5: Reviewing the Results

Review the output of your dbt run, examining the built models and generated documentation to ensure everything is as expected.

Given I’m running this to create a DuckDB database, I can simply jump into the DuckDB cli and run a SQL statement to see the results:

DuckDB SQL query on final dataset

While you may wonder why I would use DuckDB for my dbt project, it’s worth calling out that I’m running this in development. I could have my project run against DuckDB in dev, and Snowflake in prod. This would reduce the Snowflake development costs significantly. To the extent that the SQL dialect between a development and production environment are different, you can leverage the Jinja macros in place of the hard-coded SQL. This would allow you to have the Jinja return the correct SQL depending on the environment you’re in.

Conclusion

dbt has revolutionised the way we build and manage data models, providing powerful features like model testing and automatic documentation. By following this guide, you can quickly set up a dbt project with DuckDB and start leveraging the benefits of this powerful tool. I encourage you to give dbt a try and explore its capabilities further. You can find the complete project and code snippets in the linked GitHub repository here.

--

--

Andy Sawyer

Bringing software engineering best practices and a product driven mindset to the world of data. Find me at https://www.linkedin.com/in/andrewdsawyer/