Practical tips to get the best out of Data Build Tool (dbt) — Part 2

Stefano Solimito
Unboxing Photobox
Published in
7 min readApr 2, 2020

In my previous post:

Practical tips to get the best out of Data Build Tool (dbt) — Part 1
https://medium.com/photobox-technology-product-and-design/practical-tips-to-get-the-best-out-of-data-building-tool-dbt-part-1-8cfa21ef97c5

I explained different approaches to split functionalities in dbt projects while building a data platform and how to best organize dbt models.

In this article, I will discuss the following topics:

  • get the best out of your dbt_project.yml file.
  • dbt makes you rethink some aspects of traditional data warehousing.
  • dbt Macros usage.
  • wrap dbt.

Get the best out of your dbt_project.yml file

Dbt_project.yml contains your dbt project configuration. It can easily become an unreadable monster or be almost empty depending on how you are using dbt.

For each dbt model, it is possible to define the model configuration at the model level but when the number of models starts to grow it might become complicated to jump from one model to another to read these configurations.
For this reason, in Photobox, we adopted the dbt_project.yml file as the main configuration file. This can be integrated with the metadata defined on the top of each dbt model when needed.

In Fig-0 an example of a dbt_project.yml file:

Fig-0: dbt_project.yml configuration example.

As you can see, the project folder structure is reflected in the file itself.

For each layer, a standard materialization policy is defined.

In the above case, for the STAGING_WAREHOUSE schema, every table will be created as a view.

Tagging every folder allows more flexibility when it’s time to run our pipelines. In fact, this enables us to run only a subset of the models without having to rely on upstream/downstream dbt run options.

Every model will inherit the general configuration defined in dbt_project.yml, which can be overwritten by the metadata contained in the model itself.

For example:

Fig-1: dbt model configuration example

In this case, even if the standard materialization policy for WAREHOUSE schema is “table”, the model will be built as an incremental model based on the unique key provided.

dbt makes you rethink some aspects of traditional data warehousing

In traditional data warehousing, there is the tendency to run your dimensions first and then your facts. One of the reasons for that is that joins are normally performed on surrogate keys instead of natural keys.

This is done both for performance reasons (join on a number is faster) but also to decouple the key from the business logic.

Traditionally these keys were generated as incremental numbers and added into the fact after the dimensions run was finished.
In modern column-oriented data warehouses, this strategy doesn’t work incredibly well.
Update operations are sometimes expensive and autoincrement is not widely supported or requires some workarounds.

With dbt there is the possibility to generate surrogate keys as the hash of the concatenation of multiple fields, as shown in the Fig-2.

Fig-2: Surrogate key generation using dbt

This makes you rethink how facts and dimensions can be orchestrated.

Some orchestration constraints can be removed and the dependency between foreign keys and primary keys, based on autoincrement, can now be based on hash functions.
Hashing a business key is something that can be done independently in facts and dimensions, without the need to inject those keys from the dimension into the fact in a later stage.

For this reason, in our data platform, most facts and dimensions can safely run in parallel, reducing the processing time required to refresh the warehouse layer.

At the end of each execution, a set of tests runs to ensure that referential integrity among tables is maintained and there are no broken links in our star-schema.

dbt test output is stored in a tabular format in our warehouse and a dashboard runs on top of it for a quick inspection.

Fig-3: An example of how looker can be used on top of Snowflake to display dbt test errors on the entire data platform. Numbers shown are purely illustrative.

dbt macros usage

We have seen how surrogate key generation is handled by a function that dbt provides out of the box but.

But, what happens when generating a surrogate key requires more logic than concatenating a set of fields and compute their hash?
In that case, you might have to copy the same logic to process the surrogate key in both your dimensions and facts.
This might not seem a big problem but, what will happen if after a few months the logic to build that key changes?
How can you ensure that your logic is correctly updated in all your queries avoiding your referential integrity tests to fail?
This is a case where macros come in extremely handy, ensuring that every piece of shared logic is kept in sync.

Fig-4: A set of macros used to help compute the same set of fields from different data sources.

In Fig-4 the same 3 fields are calculated in different ways. The code is very simple but every data source involved in the process requires a slightly different logic.

These data sources will be unioned in a later stage and the fields computed in Fig-4 will be used to generate a surrogate key. The surrogate key in our example will be used to join a clickstream table (fact) with the user interaction type (dimension).

Because all the chunks of logic have been wrapped in macros we can ensure that the same rules will be applied consistently across all the tables that require to generate the same key starting from “interaction_type”, “interaction_name”, “funnel_classification” fields.

Using macros in this way allows our data engineering team to be more reactive to changes and ensure that those changes are correctly propagated to all dbt models that are using that specific piece of code.

Wrap dbt

Using plain dbt, a project can be executed from the command line, passing the desired parameters.

This usage of dbt can fit most of the use cases but there are a few reasons that might make you consider to wrap your dbt project in an application.

In Photobox we decided to develop our own wrapper for the reasons below:

  • dbt only does SQL query orchestration. But we might want to use a REST API, perform an operation with Pandas etc.
  • We want to ensure that every time the data pipeline runs, all the relevant tests are executed and the tests logs are exported.
  • We want to dynamically switch configuration and Snowflake virtual warehouse based on the dbt models we are running.
  • We can provide a standard template generated through Giter8 that can be used as a starting point for all our data pipelines regardless of the task they have to perform.

Fig-5 shows how Photobox dbt Python wrapper fits into a project:

Fig-5: Photobox dbt wrapper folder structure.

Every instance of the wrapper comes with:

  • A folder with CI pipelines ready to be deployed in multiple environments and ECS Tasks definitions (since our data pipelines run as ECS tasks).
  • An application folder with the python wrapper which contains:
  1. The dbt project itself

2. A Configuration folder where secrets are downloaded and dbt configuration is dynamically generated based on environment and models to run.

3. A Queries folder for any query you might need to run outside dbt dags I.E. Local processing of the output through pandas.

In Photobox, Airflow is used as an orchestrator and the execution of our warehouse layer is split into multiple Airflow tasks.
Each Airflow task triggers an ECS Task that runs the same docker image containing the warehouse dbt project in our python wrapper.

This means that every warehouse Airflow task is actually running the same docker image, simply passing different parameters depending on which dbt models we want to run.

How this orchestration works will be explained more in detail in the third of this series of articles.

In the next episode

I hope this second part gave you further material of discussion with your team while you are getting familiar with dbt.

In the last part of this series I will discuss:

  • How in Photobox we orchestrate data pipelines and dbt models.
  • How generate and serve dbt documentation to your consumers.
  • A short summary of my experience with dbt.

Stefano Solimito is a Principal Data Engineer at Photobox. You can follow him on LinkedIn.

--

--