DBT / BEST PRACTICES

How to Structure Your DBT Project

Paul Fry
Geek Culture
Published in
7 min readApr 20, 2022

--

This post describes the different approaches for structuring your DBT project(s) and, most importantly, it goes through the steps involved to do so.

Agenda

  1. Target audience
  2. Problems with the default DBT project structure
  3. Candidate DBT project structures
  4. Demos/How-to’s
  • How to implement the DBT ‘layers’ project structure
  • How to reference objects between DBT projects
  • How to generate DBT documentation across multiple DBT projects

5. Summary

6. Further Reading

1. Target audience

This post is intended for those with some familiarity with DBT & wish to understand best practices regarding DBT project structure.

2. Problems with the default DBT project structure

Unnecessary/redundant query processing

When you install DBT by default, you’re using the ‘monolithic’ DBT project structure. Doing so can have significant repercussions further down the line when you make changes across your data warehouse. For example, a proposed change can trigger a series of unnecessary queries due to DBT model dependencies. Unless you consistently tag/target objects in your DBT project when running DBT commands, your changes will likely try to run a whole series of queries across your entire data model?!

Multiple complex use cases within a single DBT project — poor housekeeping/developer experience

It’s more than likely that you have a reasonably mature data warehouse structure that has been built up over time, catering for typical use cases like:

  • Source data processing
  • Data quality processing
  • CDC processing
  • Data model transformations
  • Data Mart management/presentation layer objects

As a result, housing all of the transformation logic required to support the above in a single DBT project will likely lead to a poor end-user/developer experience. I.e., you’ll end up with a very congested git repo. An alternative approach is to use multiple DBT projects to manage your data warehouse transforms. Described below are some of the options available.

3. Candidate DBT project structures

How do you want to process and organise your DBT project files? Do you want to build an entire data model each time a change is made? Should you instead create multiple DBT projects for your warehouse transforms? Or group similar logic together within smaller DBT models?

The image (and project structures described) below come from Stefano Solimito’s medium post. However, this post focuses on the steps involved in creating such non-default DBT project structure, using the layered project structure as an example:

Source: Practical tips to get the best out of Data Build Tool (DBT) | Medium.com

Default DBT Project Structure: ‘Monolithic’ (Single DBT Project)

If you don’t make a design decision around how to structure your DBT project, you’ll default to following the ‘monolithic’ structure. In this project structure, you use only a single DBT project. Shown below is what the file/folder structure would look similar to:

The default ‘monolithic’ project structure consists of a single DBT project

DBT ‘Layers’ Project Structure (Multiple DBT Projects)

Tackling the various types of data processing within a single DBT project has its benefits & drawbacks. An alternative approach to having a single, monolithic DBT project is to represent your data landscape is to instead create multiple DBT projects. As an example of this approach, shown below is what the file/folder structure would look like for a layered DBT project:

Using the ‘layered’ project structure, the layer is itself a DBT project

In this project structure, each ‘layer’ is a child DBT project, that are installed as DBT packages against a parent project (indicated by <dbt project name> above). Described below is how the folder structure/contents of a layered DBT project structure differentiate from the default monolithic project structure:

In a layered structure, the DBT projects are installed as child projects underneath the data modules folder

What about DBT documentation? Can I generate DBT documentation across multiple DBT projects?

You can! See ‘how to generate DBT documentation across multiple DBT projects.

4. Demos/How-to’s

Described below are supporting demos/how-to guides to aid using either the ‘layers’, ‘verticals’ or ‘layers and verticals’ project structures.

4.1. How to implement the DBT ‘layers’ project structure

1. First, create a parent dbt project, as per the way you’d usually create a DBT project: dbt init <dbt parent project name, e.g. ‘bike_shop’>

2. Install all child DBT projects by:

a) Listing all ‘child’ projects within the packages.yml file:

DBT ‘child’ projects are installed as DBT packages

b) And installing these child projects by using the DBT CLI command dbt deps

3. Doing this creates a copy of the DBT files (that you’ve referenced in packages.yml) into the dbt_modules folder, found at the project root:

Note: you can also install DBT packages (including child projects) from a Git repo. To do so, follow the instructions on DBT’s package management documentation page. The screenshot below highlights what’s required to install either a local or remote DBT package:

With this in mind, you could have individual git repos for each DBT project/processing layer.

4.2. How to generate DBT documentation across multiple dbt projects

Because of the DBT parent-child project relationship mentioned, it means users can view DBT documentation from both:

  • A single DBT project perspective (i.e., a single ‘layer’)
  • And also across all DBT projects (layers)

The example above indicates how DBT projects (i.e. layers) can be selected to update the DBT DAG dynamically. Note the greyed out tables; these are from deselected data_quality and snapshot_cdc_processing DBT projects (screenshot below):

To design your DBT project to allow DBT documentation to be generated in this way, do the following:

  1. First, create a parent DBT project: this is what we’ll use to generate our DBT documentation against
  2. Then create child DBT projects for each of your child projects
  3. Then against the parent DBT project, generate your DBT documentation by running: dbt docs && dbt serve
  4. And voila! You have a DBT DAG is now showing lineage across all stages of your warehouse!

Following this approach means you can focus your efforts on the transforms required per-DBT project layer, and not have to worry about how this impacts DBT documentation being generated.

4.3. How to reference objects between DBT projects

This is a simple one. You need to make use of the DBT ref() function. However, rather than only passing a single arg to the ref function as you usually do, you instead pass two:

  1. The first to indicate the DBT project (package) you wish to reference
  2. The second is to specify the object you want to select

For example:

SELECT *FROM {{ ref(‘snapshot_cdc_processing’, ‘customers_snapshot’) }}

Where snapshot_cdc_processing is another DBT project (package) and customers_snapshot is an object within the project:

5. Summary

This post aimed to describe how to implement the project structures described in Stefano Solimito’s medium post. As well as provide supporting considerations that will help further down the line.

If you’re new to using DBT, it’s best to start using the default project structure. Though once you’ve gotten to grips with the concepts, I recommend implementing either the ‘layers’, ‘verticals’ or ‘layers and verticals’ DBT project structure, as opposed to the using the default monolithic structure, since:

  1. Making future changes to your data landscape is seamless
  • In comparison to using the monolithic project structure, making future changes to your data landscape is very, very easy when using either the layered, vertical (or ‘layers and verticals’) approaches
  • Remove/unpicks a significant amount of dependencies

2. Avoid unnecessary processing — you can orchestrate processing to run only against specific layers/verticals

3. CI/CD integration is much easier

  • Tests can be designed and managed within a single project
  • With the monolithic structure, all DBT tests would be bundled together in a single project, which is challenging to manage and maintain

4. Data lineage is a lot more user friendly — the ability to see lineage per project OR at the parent project level is only possible in these architectures

--

--

Paul Fry
Geek Culture

Welsh data architect, based in Dublin. Certified in dbt, Airflow, Snowflake & AWS