Practical tips to get the best out of Data Build Tool (dbt) — Part 1

Stefano Solimito
Unboxing Photobox
Published in
7 min readMar 27, 2020

This series of articles is for those who already have a basic experience with dbt (Data Build Tool) and want to get the best out of it when building a data platform. If you want to read an introduction regarding how to build data pipelines using DBT, I suggest having a look at this article that I wrote a few months ago or the official DBT documentation.

dbt is undoubtedly great for performing ELT but, sometimes, is presented as a tool mostly designed for analysts. I personally disagree with that, and I think that DBT can find its space in data engineering as well.

During the last year, I’ve used dbt to build two different data platforms based on different cloud providers and warehouse technologies. dbt proved to be extremely flexible for all the data engineering teams where we decided to adopt it, if used with the right approach.

The potential cost of a mistake is much higher when you are building a data platform that has to be designed to organically grow over many years, compared to a small self-contained reporting project. For this reason, agreeing usage principles for dbt with your team at an early stage of the project is paramount.

In Photobox we did multiple iterations early in our dbt journey to be sure that the tool was both a good fit for the stack of technologies that we use in our data platform and was also able to support the different processes that we have to build in data engineering.

It’s important to notice that the data engineering team in Photobox also takes care of the warehouse layer of the data platform. This is not necessarily the approach other companies may have taken (for instance, the warehouse and reporting layer could be split between data engineering and BI, or between data engineers, ETL developers and BI).

In Photobox the ETL developer function is part of the data engineering team, therefore we wanted to come up with a tech stack that would allow:

  • Data engineering writing code rather than using a drag & drop tool
  • Analysts make use of the same tech stack, so data engineers could easily deploy their models in the reporting data layer, but especially, in order to have data lineage and documentation in the same place.

In this series of articles, I will discuss 8 things that I learned from my experience with dbt and that I think can be valuable for other data engineering teams. You might have different opinions regarding how to use dbt but, hopefully, these 8 points can be a good way to start a discussion with your engineers on the way you are using or planning to use the tool when building a data platform.

For part 1 of this series of articles, here my practical tips to get the best out of dbt:

  • One data platform, how many dbt projects?
  • dbt Models organisation.

One data platform, how many dbt projects?

Let’s say you are building your company’s data platform. The data ingestion process has been taken care of and the RAW data is now accessible through one of the dbt supported databases. Now what?

You will probably have all your data in a RAW data layer, ready to be cleaned. People will start to speak about building reports, model the data and ask about how data can be accessed. Everyone that can benefit from the new data platform will start to get excited. But you know you still have a lot to do, and whatever transformation process you are going to build, you want to get the right approach as early as possible.

In my experience there are four different approaches that you might consider when you organise your dbt projects:

  • Monolithic: A single dbt project for your entire platform. It will clean your data, build fact and dimension and materialize any table that might be needed to support your reporting and analytics.
  • “Micro-services”: Ok, it’s not really microservices but you got the idea. Multiple small projects, each with a single purpose. For instance “clean data source X and store it in data quality layer” or “Build dimension Y”
  • Layers: Your data platform will be normally structured in logical layers. A common layer structure is:

RAW Layer: No transformations and cleansing applied to the data. Available as soon as data is ingested but likely to have poor usability and performance.

DATA-QUALITY: Quality rules are applied. Data is in a tabular format cleaned and ready to be consumed. No data modelling is applied. Data not immediately available after ingestion since some processing is required.

WAREHOUSE: Where facts and dimensions are built. Due to the complexity tables in this layer are refreshed less frequently than the previous layers.

REPORTING: Contains optional further aggregation on the top of the warehouse. These tables are optimised to serve some specific reports faster than tables in WAREHOUSE but are less flexible.

The approach can be summarised as ‘Build one dbt Project for each layer’ I.E. raw-to-dataquality, dataquality-to-warehouse, warehouse-to-reporting

  • Layers and verticals: This approach is similar to Layers approach but after dataquality one dbt project is built with all facts and dimensions relevant for the entire organisation and some vertical-specific projects are created to serve specific areas of the organisation.
Fig-0: Graphical representation of the different dbt projects organisation approaches.

The table below summarises the pros and cons of the listed approaches.

In Photobox we tried first with a monolithic approach until it was clear that, due to the growth rate of the platform, a different solution was required.

During my experience at The Telegraph, I tested a “Microservices” approach but we immediately understood that what might work in the API world doesn’t really work with data pipelines. Data lineage was scattered across multiple repositories, and keeping our macros in sync and ensuring dependencies were respected was just a massive headache.

In the end, in Photobox, we decided to adopt the Layers and Verticals strategy.
What made us choose this option was having clear dbt documentation for each layer and, at the same time, allow flexibility around the core area of our data platform, enabling in this way other teams to play an active role in building data pipelines.

dbt Models organisation

At Photobox our dbt projects contain dozens of different models and sometimes the transformations applied to data are not trivial to understand. For this reason, proper organisation of the project structure is key to allowing our engineers to jump from one project to another and being able to implement a new feature or a fix without getting tangled up by the complexity of the project itself.

Let’s consider a hypothetical data source X that has to be cleansed and moved from the RAW data layer of our platform to the DATA-QUALITY layer. This is how dbt models are organised for this case:

Fig-1: dbt Models organisation example

Fig-1 shows one folder for each layer of our data warehouse. One extra folder is created to store intermediate transformation and cleansing steps.

Every folder has inside a schema.yml file that looks like the example in Fig-2 and it is used both for documentation and schema tests.

Fig-2: Schema.yml example. Shows schema documentation and tests.

For complex steps where inline documentation might not be enough, a markdown document is created and linked in schema.yml to allow a more detailed explanation of what that specific step does.

In Fig-1 intermediate steps are organised in subfolders based on the source and numerated to ensure the execution order is clear.
Processes like cleaning a data source have a standard set of steps (defined upfront) that are always executed in the same order. Deviations from this path are allowed but have to be justified.

Organising our models in this way and agreeing on general transformation principles made our projects much more understandable for our engineers.

This might seem a small change but it actually increased the productivity and confidence of the entire team.

There is always overhead in defining standards but it’s time and effort well invested since once finished the development moves much faster.

In the next episode

In the next post https://medium.com/photobox-technology-product-and-design/practical-tips-to-get-the-best-out-of-data-build-tool-dbt-part-2-a3581c76723c

I will discuss:

  • How to get the best out of your dbt_project.yml file
  • dbt Macros usage
  • How dbt makes you rethink some aspects of traditional data warehousing
  • Wrap dbt in a Python application

I hope this post can help you to discuss with your team some aspects of adopting dbt and spare you of some of the challenges that we have faced during our journey in Photobox.

Stefano Solimito is a Principal Data Engineer at Photobox. You can follow him on LinkedIn.

--

--