DBT — Best Practice

Published in

Israeli Tech Radar

7 min readFeb 18, 2024

Yes, you can have a look at chat GPT for a list of best practices. GPT even gives a good list: Version Control, Documentation, Modularization, Naming Conventions, Testing, Environment-specific Configurations, and Scheduled.

The list is very nice but does not manage to bring the gestalt of what is needed for a good project. So I will try here to bring you the main building blocks in my opinion.

Model naming and structure

This is the first and main block that needs to be tackled. When deciding on names and structure, you need to take into account both the dev environment and the production environment.

The dev environment is how the folder and files are named and organized. For production the question is to which databases are each set of tables added.

For an in-depth analysis of the project structure see my previous blog DBT — Project Structure. For your production database structure see my blog on DBT — Database Schemas.

Packages

Before inventing the wheel also check what is out there. DBT themselves have a set of packages that you can use. They are hosted at hub.getdbt.com. In addition, do a simple search on the web and you will find the most used packages out there. As an example, if you are working with external sources such as salesforce, you want to check what packages support the salesforce models, for example, salesforce_formula_utils. So go have a look at the getdbt hub, there are hundreds of packages there.

For an article about how to host your own packages and issues that might arise see The In and Out of the DBT Package.

Macros

In the end, not everything that we want is out there, and we have to create our own. Like any POC you start locally within the monolith and then break out. Here you can start within the project, but very quickly you should create a separate git project for your own macros, and serve them within your company as a package.

You want to limit the copy-paste between projects, and versioning of the macros you share.

Environment-specific Configurations

DBT supports a very nice interface using a profiles file. You must remember that security is very important. You do not want to add any secrets to this file. All secrets should be saved in environment variables on your machine and injected into the profile file.

DBT pushes you to very easily create two profiles one for dev and one for prod. I think you should not be enticed to use this. You do not want your developers changing anything in prod my mistake. So you should only have dev and staging profiles and not prod.

In my opinion, only the CD should generate the profile for prod with the correct connections and secrets. This direction will also make sure that you properly test your code before deployment, and do not rely on running it locally.

Scheduling

Scheduling and orchestration is a very important area and an area the DBT has yet to conquer. DBT gives you a simple CLI utility to run models. While the options to run models are very rich, it does not supply the demands of a real production solution.

This means that you need to think ahead of time, what are the schedules that you want your models to run according to. You need to make sure that you do not run models needlessly ->, for example, to refresh a model every hour when it is based on a model that is refreshed once a day, is very wasteful.

There are also other issues of data testing and transactions that you need to think about. For a more in-depth article on this subject have a look at DBT Orchestration.

CI/CD

Every project needs a test/ci/cd cycle, and a DBT project is no different. Your project should have unit tests for difficult business use cases. There are different frameworks out there for mocking your source data to test the modeling. An overview of the different types of testing can be found at: A Simple (Yet Effective) Approach to Implementing Unit Tests for dbt Models. You should also have a look at the dbt_unit_testing package.

In my opinion, every branch that is opened should run all the models that have been changed to validate that they can be run. All unit tests for these models should be run. If the data for them is very big, then you can start by having all models use view mode, and configure the different persistent models for production only.

Once you merge your code to main/master you should run all the models to validate that they all work well together and that no code or packages are missing.

Production

We now have a project that can be released to production but how should we release it?

The best practice is to create a docker image that contains all the relevant models + packages + dbt core version. Even if you are using airflow or some other Python-based orchestration that has DBT installed, you should not use the installed version. This is a bad habit that brings a lot of trouble down the line. You want to be able to decide when and where you want to upgrade your DBT core. You might roll it out on some domains and not on others. The only way to support this is by having a self-contained docker for the models that include the DBT inside.

In addition, all packages should be re-downloaded and installed to the docker. This will reduce the runtime of the models that need to download the packages before running.

Documentation

Documentation is one of the hard issues to tackle. Most people don’t want to write docs. DBT makes it a bit easier by adding the docs as part of the source code. We advise to have a file per model with the documentation of the model. To make life easier we usually add documentation on the source, and also on the marts level.

Why is documentation important? Since we want to create a Self Service Data Platform, it needs to be self-explanatory. When other people need to search for different datasets to base their queries on, documentation becomes very important.

There are a lot of enhancements that need to be done to make the world of documentation easier. For instance, we would like to see a pre-commit that mandates coverage of documentation before you commit your code.

In addition, it would be nice to see the option of documents to be inherited between models, so if I defined a document on a field of a base table, I would expect the same text to appear on all tables that then use this field.

Data Catalog

Data Catalog is a new boy on the block. It looks very promising but seems to be very difficult to get one that is good and will give value to the company. DBT helps a lot with this, and most data catalogs can import the DBT data schemas that include also data lineage. After saying that I will add that I have my eye on datahub, it is open source and looks very promising.

CLI

We have gone over a lot of best practices, but how do we enforce them, and how to we make the lives of developers easier?

The basics of the CLI are to add functionality to your project and also constraints to what is allowed to be done.

The basics are to generate the yaml files and create new models with proper naming and attributes. For a deep dive into the subject see DBT Cli Design.

If you don’t want to go and create a full CLI application, then you should start with dbt power user.

Summary

Data Build Tool (dbt) has become a widely adopted practice in the field of data analytics, including in Israel. This article delves into the common use of dbt and outlines best practices for its implementation. As organizations increasingly recognize the importance of efficient data management and analysis, the article serves as a guide for maximizing the benefits of dbt, offering insights into key strategies and recommendations to enhance its application. Whether for streamlining data workflows or optimizing analytical processes, the best practices highlighted in the article aim to support practitioners in leveraging dbt effectively for improved data-driven decision-making.