Practical tips to get the best out of Data Build Tool (dbt) — Part 3

Published in

Unboxing Photobox

5 min readApr 9, 2020

In my previous posts:

Practical tips to get the best out of Data Build Tool (dbt) — Part 1
Practical tips to get the best out of Data Build Tool (dbt) — Part 2

I covered different topics that you might want to discuss, with your data engineering team, at the beginning of your journey with dbt.

In this article, I will discuss at a high level:

How to orchestrate data pipelines and dbt models.
Use dbt documentation and serve it to an entire organisation.

How to orchestrate data pipelines and dbt models

As you probably know dbt works with models DAGs (Directed acyclic graphs).
The question is, do you think this is enough to orchestrate your entire platform or some form of external orchestration is required?

In Photobox, due to how we structured our repositories, the core of our data platform relies on these 3 main dbt projects:

Raw-to-dataquality dbt project
Dataquality-to-warehouse dbt project
Warehouse-to-reporting dbt project

Each of them has dozens of models with one or more dependencies and the single purpose to move a set of tables from one layer to another of our warehouse, performing various data transformations.
Let’s take as an example the data lineage of dataquality-to-warehouse in Fig-0 where every blue box is a dbt model and every green box is a data source.

*Fig-0: Photobox dataquality-to-warehouse dbt project, entire data lineage as displayed in dbt documentation.*

Do you think this is understandable? It seems to be confusing due to the size of the graph and multiple dependencies.

We could use dbt to run the entire project but we will not have much control on retries, failures of single tasks and we will have just to hope for the best.

Alternatively, we could run just a part of it.

For instance, I might want to run only a portion of the project that takes care of the events collected by the product configurator on Photobox’s website.

*Fig-1: Photobox dataquality-to-warehouse dbt project, editor sub-process data lineage as displayed in dbt documentation.*

Compared to the entire project data lineage, this seems easier to understand and small enough to rerun in case of failure.

In Photobox we decided to orchestrate dbt sub-DAGs using Airflow.

Using external orchestration we can have better control over what we run. This allows us to rerun only part of the process that didn’t succeed without wasting computational resources.

Each pipeline we built is idempotent and can be re-executed anytime, assuming that all its dependencies are satisfied.

Also, breaking down a dbt project in multiple Airflow tasks, allows more flexibility on how to use Snowflake virtual warehouses. Different dbt sub-Dags can run on different virtual warehouses depending on the computational resources required.

*Fig-2: Orchestration of dbt sub-dags using Airflow.*

The dbt project shown in Fig-0 can be broken down in airflow tasks shown in Fig-2.

With this approach, we can clearly see and manage dependencies between different dbt sub-DAGs having an immediate understanding of what went wrong in case of failure.

Document and serve

dbt automatically generates documentation in the form of a static website from your project.
How good your documentation is and how the consumers can access it is up to you.

In Photobox our pipelines were ranging from “pointless over-documented pipeline” to “I don’t have a clue what is happening here”. For these reasons, we started to define standards not only for ELT implementation but also for how this is documented.

As a first step, we standardised DBT models folder structure and agreed on naming conventions and subsequently, we agreed on what has to be documented and how:

Every table that is materialized and accessible to our consumers has every field described inline. The table description can be inline or using a markdown document (depending on how complex it is).
Every intermediate model that runs on our staging schemas contains an inline description of the step. If the step is non-trivial then a markdown document is used to describe in detail.
When a design is provided by architecture for a certain table the confluence page that contains the design is linked in the DBT documentation.

*Fig-3: DBT Documentation an intermediate step.*

Dbt generates a static website that can be used to expose the documentation.

This feature can be integrated with your CI pipeline to ensure your DBT documentation is refreshed after every release.

In Photobox we use AWS code pipeline as serverless CI and all our DBT projects are dockerized to run in ECS.

All our CI pipelines are performing the following steps:

Install dependencies
Build dbt docker image
Test dbt docker image
Upload dbt docker image on ECR
Create/Update ECS Task definition
Generate dbt documentation
Upload dbt documentation (S3)

For each environment, we created an S3 bucket to host our documentation (host a static website in S3).

Each datapipeline’s CI uploads the documentation generated into S3 automatically refreshing the documentation website.

A landing page (Fig-4) is used as an index of the different dbt documentation websites.

*Fig-4: Photobox dbt documentation index page.*

Summary

dbt is an extremely powerful and flexible tool that quickly enables your team to build data-pipelines. This flexibility comes with a price.

Without establishing a set of shared principles and agreeing on a common project structure that can be used as a guide to building dbt projects you might soon find yourself with plenty of exotic ELT processes.

In Photobox it took multiple iterations to set our standards and ways of working and we aren’t finished yet but we are in a much better place than a few months ago.

Whatever standards you decide to adopt for your dbt projects, it is important that you invest enough time before rushing into building what could potentially become a massive technical debt monster.

I hope this article can help you to discuss with your team some aspects of adopting dbt and spare you of some of the challenges that we have faced during our journey in Photobox.

Stefano Solimito is a Principal Data Engineer at Photobox. You can follow him on LinkedIn.

Practical tips to get the best out of Data Build Tool (dbt) — Part 3

How to orchestrate data pipelines and dbt models

Document and serve

Summary

Written by Stefano Solimito