Practical Guide to DBT: Organizing the project folders.

A 3-step journey to mastering the data transformation tool — DBT — Part 2 of 3.

10 min readAug 28, 2023

This is the second in a series of articles aimed at guiding and providing more information about DBT, a widely used tool in the current market for data transformation. In this article, we will explain how we organize the main folders of our project and the role of each one. If you’d like to read the first part of the guide, simply access the link: Unraveling Architecture and Initial Configuration. And if you’re interested in delving even further into the organization of a DBT project, I recommend taking a look at the official documentation of the tool: About dbt projects.

>>> Repository organization

After setting up your local machine, it’s time to open VSCODE. From there, you enter the container that was created in part 1 of the guide and begin building the initial structure of the project. This structure will later be shared with other team members through a repository of your choice, such as Github or Gitlab. Below, we’ve listed some crucial folders and files that will be detailed throughout this article.

Data

Folder for storing Seeds.

Seeds are CSV files within the DBT project that are loaded into the Data Warehouse using the ‘dbt seed’ command. They can be referenced in other modeling processes normally using the ref function — {{ ref(‘table_name’) }}.

Because the CSV files are located within the DBT repository, they are updated through version control. Due to this, they are best used for static data that doesn’t require constant changes.

Examples of good uses for seeds:

List of DDI codes and their corresponding country names.
List of test emails that need to be excluded from analyses.
List of APP IDs for event classification by business unit and event type.

Examples of bad uses for seeds:

Loading raw data that was exported to CSV.
Any production data containing sensitive information.

Below is an example of a file stored in this folder:
data/custom_dvce.csv

os_family,dvce_type
iOS,mobile
Android,mobile
BlackBerry Tablet OS,mobile
Windows,desktop
FreeBSD,desktop
Ubuntu,desktop
…

Packages

Software engineers often modularize chunks of code into libraries that help the community focus more on the specific logic of their businesses, rather than wasting time implementing functions that others have already spent time perfecting.

In dbt, libraries like these are referred to as packages. DBT packages are highly useful because many of the analytical problems we encounter are shared across organizations. For example:

Transforming Snowplow data into categorized tables.
Macros for working with data loaded from Stitch.
Packages that log execution data of modeling processes.

DBT packages are essentially standalone dbt projects, with models and macros that address a specific problem area. As a dbt user, when you add a package to your project, the models and macros from the package become part of your own project.

To use them, you need to identify the name and version of the package in the packages.yml file. After that, you should run the ‘dbt deps’ command to install them locally. Upon installation, the package information will appear in the dbt_packages folder. This directory is ignored by Git to prevent duplicated sources. Therefore, whenever someone starts working on the project for the first time, they should run the command.

Here’s an example of a packages.yml file with the main packages we currently use:

packages:
  - package: calogica/dbt_expectations
    version: [">=0.3.0", "<0.6.0"]
  - package: dbt-labs/dbt_external_tables
    version: 0.8.4
  - package: elementary-data/elementary
    version: 0.7.6

Macros

In DBT, you can combine SQL with Jinja. With Jinja, you can transform your dbt project and have the capability to achieve things that are not typically possible in SQL. For example, using Jinja, you can:

Utilize control structures (such as if statements and for loops) within SQL.
Use environment variables in your dbt project for production deployments.
Create abstract sections of SQL in reusable macros — analogous to functions in most programming languages.

Macros in Jinja are pieces of code that can be reused multiple times. They are defined in a .sql file and stored in the macros folder. These files can contain one or more macros.

Example:

Macro:

{% macro parse_dt_string(column) %}
  CASE WHEN {{ column }} IS NOT NULL
    THEN TIMESTAMP 'epoch' + {{ column }} * INTERVAL '1 second'
    ELSE NULL
  END
{% endmacro %}

Model:

SELECT
  {{ parse_dt_string('time') }} AS t_time,
  id as payment_id,
  …
FROM braze_data.notifications

It’s also possible to use macros from ready-made packages, as shown in the previous topic. They are grouped into online packages that can be installed locally for use. Whenever using them in modeling, remember to add the package name as a prefix. An example of a very popular package is dbt-utils:

{{ dbt_utils.group_by(5) }}

Models

Models are where developers spend most of their time in a dbt environment. Models are primarily written as SELECT statements and saved as .sql files. We categorize the models into sources, marts, and staging types.

>Sources: These allow you to name and describe the data loaded into your warehouse by your extract and load tools. By declaring these tables as sources in dbt, you can then reference them in your analytical models.

The folders always consist of a documentation file (schema.yml) and a source identification file (sources.yml). The latter is used to identify the schema and table in Redshift.

Example:

version: 2

sources:
  - name: source_name
    schema: schema_name
    loader: DMS

freshness:
      warn_after: { count: 12, period: hour }
      error_after: { count: 24, period: hour }
    loaded_at_field: tstamp_column

    tables:
    - name: table_1
    - name: table_2

Source data is typically the responsibility of the engineering team. Here at Pilar 2 of Afya, we also separate them into two subgroups:

RAW: These are raw data that we simply call in dbt and adjust the column and table naming. To reference them, use the condition: {{ source(‘source_name’, ‘table_name’) }}

{{
  config(
    materialized = 'incremental',
    full_refresh = false,
    unique_key = 'column1',
    dist = 'column1',
    sort = 'column2',
    )
}}

WITH source AS(

    SELECT 
      column1                  AS tn_column1,
      column2                  AS tn_column2,
      column3                  AS tn_column3,
      column4                  AS tn_column4
    FROM {{ source('source_name', 'table_name') }}
    WHERE 1=1
    {% if target.name == 'dev' %}
      AND column2 >= DATEADD(DAY, -{{var('testing_days')}}, CURRENT_DATE)
    {% elif is_incremental() %}
      AND column2::DATE >= DATEADD(DAY, -{{var('filter_days_raw')}}, (SELECT MAX(tn_column2::DATE) FROM {{ this }}))
    {% endif %}

)

SELECT
  tn_column1,
  tn_column2,
  tn_column3,
  tn_column4
FROM source

Custom: Due to having heavy event data, custom models were created to niche the data by business unit and event type, making it lighter for use in mart models. This resulted in significant performance improvement.

>Marts: In the mart folder, we have the analytical models written in .sql. They are the responsibility of the analysis and science team and are divided by update frequency and business areas. They use the tables created in the source folder as sources via the condition: {{ ref(‘table_name’) }}.

{{
  config(
    materialized = 'incremental',
    full_refresh = false,
    unique_key = 'column1'
    dist = 'column1',
    sort = 'column2',
    )
}}

WITH cte_1 AS(

    SELECT 
      column1,
      column2,
      column3,
      column4
    FROM {{ ref('model_name') }}
    {% if is_incremental() %} 
    WHERE column2::DATE >= DATEADD(DAY, -7, CURRENT_DATE)
    {% endif %} 

)
...

All folders also have a documentation file (schema.yml). Here's an example of the file:

version: 2

models:
  - name: model_name
  description: table description.
  columns:
    - name: column1
    description: description of column 1.
    tests:
      - not_null
    - name: column2
    description: description of column 2.

Note: This file can also be used in the data and macro folders by sequentially changing the name from ‘models:’ to ‘seeds:’ and ‘macros:’.

>Staging: These are intermediate tables typically used for time slicing to aid in the performance of incremental loads, as well as for general rules that will be used only as helpers for other main modeling processes.

Tests

Tests are assertions you make about your models and other resources in your dbt project (e.g., sources and seeds). You can use tests to enhance the SQL integrity in each model by making assertions about the generated results. You can also test whether a specified column in a model contains only non-null values, unique values, or values that have a corresponding value in another model.

When you run the dbt test command, dbt will inform you whether each test in your project has passed or failed. This function is crucial for ensuring the quality of stored data.

There are two types of tests: the generic ones present in the schema.yml of the models, and the “singular tests”. The latter type is stored in the tests folder in .sql format.

Example:

{{
  config(
  tags=["data_test"],
  severity='error'
  )
}}

WITH dms_sync_diff AS(

  SELECT
    max(TO_DATE(day, 'YYYY-MM-DD'))  AS synced_at,
    GETDATE()                        AS ts_now,
    DATEDIFF(min, synced_at, ts_now) AS min_diff,
    CASE
      WHEN DATE_PART(dow, GETDATE()) IN (0, 6)
      THEN 1440 ELSE 4320 
    END                              AS minutes_allowed
  FROM {{ source('source_name', 'table_name') }}

)

SELECT *
FROM dms_sync_diff
WHERE min_diff > minutes_allowed

Project

Every dbt project requires a dbt_project.yml file. This file informs dbt that a directory is a dbt project and contains crucial information that guides dbt on how to operate the project.

Breaking down the file:

>Part 1: In the initial section, we identify:

Project name, project version, required dbt version, and the profile name, which should match the one listed in the profiles.yml file exemplified in the first part of this guide.

name: "dbt_pebmed"
version: "1.0.0"
config-version: 2
require-dbt-version: [">=1.6.0", "<1.7.0"]

profile: "pebmed"

Names of folders created according to the roles existing in dbt and the folders we want to keep hidden.

model-paths: ["models"]
test-paths: ["tests"]
seed-paths: ["data"]
macro-paths: ["macros"]

target-path: "target"

clean-targets:
- "target"
- "logs"
- "dbt_packages"

It’s also possible to define certain commands to run at the end of each project execution. In our case, we call the macros created to grant Redshift groups permissions on every new schema created in dbt.

on-run-end:
  - "{{ grant_select_on_schemas(schemas, 'group_1) }}"
  - "{{ grant_all_on_schemas(schemas, 'group_2') }}"

>Part 2: In the middle of the file, we identify the folders containing models/seeds to mark the suffix we want to add to Redshift schemas (e.g., dbt_prod_source). If the folder doesn’t have any models created yet, it’s recommended to leave it commented to avoid generating an alert every time a model is executed.

The ‘+persist_docs:’ tag ensures that the schema.yml files for the models generate the table comments in the data warehouse.

models:
  elementary:
    +schema: elementary
  
  dbt_pebmed:
    +persist_docs:
      relation: true
      columns: true

    staging:
      +schema: staging

    sources:
      +schema: source
    
    marts:
      +schema: marts

seeds:
  dbt_pebmed:
    +persist_docs:
      relation: true
      columns: true
    +enabled: true
    +schema: seed_data

>Part 3: At the end of the file, we identify the variables to be used in the models. This way, when we need to make a change, it will impact all models simultaneously.

vars:
  'dbt_date:time_zone': 'America/Sao_Paulo'
  testing_days: 1
  filter_days_raw: 2
  filter_days: 2

Additional Files

I will now address some files that aren’t essential but have proven useful in our daily practice with the project.

.gitignore: In this file, we specify which folders and files should be ignored by Git. This means that whenever we make a new commit, the files listed here will not be included. Below are the folders we currently exclude, as they contain package data or individual executions.

target/    
dbt_packages/
logs/

README.md: It is a commonly used file in Git projects where you can detail the purpose and specifications of your project in-depth. This means that whenever someone opens the repository, the content of this file will be highlighted. The structure of this file follows specific language patterns. As an example, I’ll leave the README created in the DBT repository:
- Repository: [link]
- Original README: [link]
.sqlfluff: This is an automated SQL code formatting tool that played a key role in standardizing our models. For more details about this feature, I recommend reading the article: Automating SQL code formatting with SQLFluff.
selectors.yml: As the need to customize new execution flows arises, this file gains importance. With it, you can organize the execution plan of Jobs, all in a documented and version-controlled manner.
By default, our main job executes the sequence of sources daily, but we can also customize some jobs for specific situations. To execute, we use the command: dbt run — selector job_models_daily
Below, I provide an example configuration of the file, where we can define the Jobs, their descriptions, and which models should be included or excluded. Additionally, there is the option to determine whether, when executed, the parent and child models of the selected folder or file will also be triggered:

selectors:
  - name: job_models_daily
    description: Job that runs DBT models daily, excluding model X. 
    definition:
      union:
        - method: path
          value: models/sources
          children: true
        - exclude:
            - method: path
              value: models/sources/folder_2
              children: true
  
  - name: job_models_2
    description: Job that runs model X and its parents. 
    definition:
      union:
        - method: path
          value: models/marts/folder_2/model_name.sql
          children: false
          parents: true

In the first part of this guide, I provided an overview of the tool and explained how to set it up in “Unraveling Architecture and Initial Configuration.”

Now, in this second part, I’ve showcased the main folders and files for developing your DBT project. After organizing everything, simply commit the files to the repository of your choice to share with the rest of the team and maintain good version control. For the next article, I’ll introduce DBT Cloud, the orchestrator we’re currently using in our project, so you can schedule your initial modeling tasks.

I hope this guide is helpful for those interested in a governed data transformation solution or for those looking to enhance their knowledge in this tool. DBT is an incredible tool that has transformed the daily interaction within our data team and brought greater governance to the datasets we create. If you have any questions or suggestions, please feel free to reach out to me on Linkedin.

Versão em Português: https://medium.com/@alice_thomaz/b4c7b5d84512

Part 2: Orchestrating the Project.

Practical Guide to DBT: Organizing the project folders.

A 3-step journey to mastering the data transformation tool — DBT — Part 2 of 3.

>>> Repository organization

Written by Alice Thomaz