8 features to enhance your DBT project routine.

How we improved our DBT project with 8 quick tips.

Alice Thomaz
7 min readSep 25, 2023

DBT, or Data Build Tool, is a data transformation tool that focuses on SQL queries, enabling data engineers and analysts to manipulate their Data Warehouse more efficiently.

If you’re not familiar with DBT or want to deepen your knowledge, I recommend reading my Practical Guide to DBT. In the guide, I provide an overview of the tool and guidance to kickstart your own project.

In this article, I’ll share some simple and quick tips that have eased the development routine in DBT for the Afya engineering team.

Topics mentioned in this article:

  • on-run-start & on-run-end
  • pre-hook & post-hook
  • persist_docs
  • project variables
  • dbt docs
  • meta — yml
  • selectors.yml
  • elementary

>>on-run-start & on-run-end

SQL statements or macro calls to be executed at the beginning or end of processes: dbt run, dbt test, dbt seed, dbt snapshot, dbt build, dbt compile, and dbt docs generate.

Both should be defined in your project’s file — dbt_project.yml. Below is the implementation we currently use for both:

  • on-run-start: To initiate our processes, we employ a macro that invokes a procedure in Redshift, designed to daily clean our test and temporary schemas. At the start of the macro, we implement security measures, including locks, to ensure it is only executed in a specific job on DBT Cloud. This precaution prevents any potential disruption to the team’s workflow, preventing the DROP from being triggered during office hours.

Call in the project:

on-run-start:
#limpa o schema public
- "{{ admin_call_procedure('admin.proc_admin_drop_public_tables()') }}"
#limpa o schema scratch
- "{{ admin_call_procedure('admin.proc_admin_drop_scratch_tables(0, 1)') }}"

Macro:

{% macro admin_call_procedure(procedure) %}
{% if target.name == 'cloud' %}
{% set dbt_cloud_job_id = env_var('DBT_CLOUD_JOB_ID', '') %}
{% if dbt_cloud_job_id == '123456' %}
CALL {{procedure}};
{% endif %}
{% endif %}
{% endmacro %}
  • on-run-end: For the end of processes, we execute some macros that handle access control for our Redshift schemas. In this case, since we require superuser access to perform these releases, we also include the target lock as exemplified in the Practical Guide to DBT.

Call in the project:

on-run-end:
#release test schemas - all
- "{{ warehouse_cloud_group_grant_all_on_schemas(['group1', 'group2'], ['schema1', ' schema2']) }}"
#release prod schemas - select
- "{{ warehouse_cloud_group_grant_select_on_schemas(['group1', 'group2'], ['schema1', ' schema2']) }}"
#release prod schemas - all - cloud
- "{{ warehouse_cloud_user_grant_all_on_schemas(['user1], ['schema1', ' schema2']) }}"

Macro:

{% macro warehouse_cloud_group_grant_select_on_schemas(groups, schemas) %}
{% if target.name == 'cloud' %}
{% for schema in schemas %}
{% for group in groups %}
grant usage on schema {{ schema }} to group {{ group }};
grant select on all tables in schema {{ schema }} to group {{ group }};
alter default privileges in schema {{ schema }} grant select on tables to group {{ group }};
{% endfor %}
{% endfor %}
{% endif %}
{% endmacro %}

>> pre-hook & post-hook

SQL statements or macro calls can be executed at the beginning or end of modeling processes. It’s worth noting that, unlike “on-run,” which is executed after the completion of all models in that run, the “hook” runs in conjunction with the modeling it was triggered in.

Hooks can be applied to individual model configurations or groups of models through the project file. Below are some use cases currently employed in our project:

  • post-hook: After each model, we execute a macro that checks, through a Redshift procedure, if the encoding has been set correctly, either as zstd or az64. If it’s not set correctly, we switch it to zstd in cases like “character varying(%)” and “boolean”. For a deeper understanding of the importance of encoding usage in Redshift, I recommend reading the article “Best practices with Amazon Redshift: Architecture, organization and performance optimization”.
    This is an example that can be invoked in the project and used in all models within that folder.

Call in the project:

models:
+post-hook: "{{ compress_zstd() }}"

Macro:

{%- macro compress_zstd() -%}
{#-- Only execute at runtime, not when compiling etc #}
{% if not execute %}
{{ return('select 1') }}
{% endif %}
{#-- Only execute for tables or incremental model #}
{% if target.name == 'default' and (model.config.materialized == 'table' or model.config.materialized == 'incremental') %}
call maint.redshift_dbt_enforce_encode_sp('{{this.schema}}' , '{{this.table}}');
{% else %}
{{ log('MACRO: Not compressing ' ~ this.table ~ '.', False) }}
{{ return('select 1') }}
{% endif %}
{%- endmacro %}
  • pre-hook: In incremental models where it’s not possible to define a unique key, we opt to use the “pre-hook” function to invoke a macro that cleans the data from the last few days, as specified by the user. Ideally, the number of days defined in the macro should correspond to that used in the incremental “ref{}” call clauses.
    This is an example that can be invoked individually in each modeling process when necessary.

Call in the project:

{{
config(
materialized = 'incremental',
full_refresh = false,
dist = 'even',
sort = 'col_time',
pre_hook = "{{ incremental_delete_day('col_time', '15') }}"
)
}}

Macro:

{% macro incremental_delete_day(column, day) %}
{% if is_incremental() %}
DELETE FROM {{this}} WHERE {{column}}::DATE >= DATEADD(DAY, -{{day}}, CURRENT_DATE)::DATE
{% endif %}
{% endmacro %}

>> persist_docs

It is possible to enable an option that automatically generates the specified comments in the schema.yml file directly in the database, covering both tables and columns. This functionality is supported by various dbt adapters, including Redshift, Snowflake, BigQuery, and others. In the project, I have configured this option to be applied to all models, but it is also possible to define it individually for specific models.

In the project:

models:
project-path:
+persist_docs:
relation: true
columns: true

In the model:

{{ 
config(
persist_docs={"relation": true, "columns": true}
)
}}

>> project variables

It is possible to set universal values in the project through variables, including the definition of the time zone and time slices used in our incremental models. These variables are extremely useful to avoid redundancy of values and simplify potential logic adjustments in multiple models simultaneously.

The implementation of these variables in .sql models is done through the clause: ‘{{ var(“var_name”) }}’

In the project:

vars:
'dbt_date:time_zone': 'America/Sao_Paulo'
testing_days: 1
filter_days_raw: 2
filter_days: 2

In the model:

WHERE 1=1
{% if target.name == 'dev' %}
AND col_tstamp >= DATEADD(DAY, -{{var('testing_days')}}, CURRENT_DATE)
{% elif is_incremental() %}
AND col_tstamp::DATE >= DATEADD(DAY, -{{var('filter_days_raw')}}, (SELECT MAX(p_col_tstamp::DATE) FROM {{ this }}))
{% endif %}

>> dbt docs

DBT automatically stores all the metadata of your project in a file called manifest.json, where you can access details about all the models, seeds, and packages used. You can view it in a web format by following these commands:

dbt docs generate
dbt docs serve --port 8001

After executing the second command, a new tab will open in your browser, displaying a visualization that contains detailed information about the models, code, documentation, lineage, and much more.

>> meta — yml

It is possible to incorporate custom fields into the model documentation through schema.yml, going beyond the conventional information of name and description. This feature is enabled through the “meta” function, and the data associated with these fields is automatically included in the metadata generated by DBT.

In schema.yml:

models:
- name: ---
description: ---
meta:
developer: 'Alice Thomaz'
domain: content
columns:
- name: (...)

View in the documentation:

>> selectors.yml

As the need to customize new execution flows arises, this file gains importance. With it, you can organize the execution plan of Jobs, all in a documented and version-controlled manner.

By default, our main job executes the sequence of sources daily, but we can also customize some jobs for specific situations. To execute, we use the command: dbt run — selector job_models_daily

Below, I provide an example configuration of the file, where we can define the Jobs, their descriptions, and which models should be included or excluded. Additionally, there is the option to determine whether, when executed, the parent and child models of the selected folder or file will also be triggered:

selectors:
- name: job_models_daily
description: Job that runs DBT models daily, excluding model X.
definition:
union:
- method: path
value: models/sources
children: true
- exclude:
- method: path
value: models/sources/folder_2
children: true

- name: job_models_2
description: Job that runs model X and its parents.
definition:
union:
- method: path
value: models/marts/folder_2/model_name.sql
children: false
parents: true

>> elementary

This package is focused on data observability, storing DBT metadata and execution results in tables within the Data Warehouse. After following the steps below, whenever models are executed in DBT, the “elementary” schema will be automatically updated.

It has played a crucial role in tracking the execution time of our jobs and identifying potential failures. Additionally, it complements our metadata alongside the manifest.json, which is generated automatically by DBT.

For more details on using packages in DBT, I recommend reading the article “Practical Guide to DBT: Organizing the project folders”.

Step by step:

1. Add the field to your packages.yml file

packages:
- package: elementary-data/elementary
version: 0.7.6

2. Run the command in the terminal: dbt deps

3. Add the schema break in dbt_project.yml:

models:
elementary:
+schema: elementary

4. Run the command in the terminal:

dbt run --select elementary

Some tables it generates:

  • Execution results of the models:
    - dbt_run_results
    - model_run_results
    - snapshot_run_results
    - dbt_invocations
    - elementary_test_results
  • Metadata of the models:
    - dbt_models
    - dbt_tests
    - dbt_sources
    - dbt_exposures
    - dbt_metrics
    - dbt_snapshots

DBT is a comprehensive and ever-evolving tool. In this article, we have explored some of the features that have enhanced our experience in the project.

If you aren’t familiar with DBT, I recommend reading the Practical Guide to DBT. For those who are already using the tool, I hope this text has been helpful in some way. If you have any suggestions or comments about the content, please don’t hesitate to reach out to me on Linkedin.

Versão em Português: https://medium.com/@alice_thomaz/3a2eee9a6ee6

--

--