dbt — an analytics engineering framework by dbt Labs

To dbt or not to dbt

Lessons learnt from a year of implementing dbt at Intercom

Pragun Bhutani
Published in
10 min readOct 10, 2023

--

If you work in data, you’re probably all too familiar with the problems that come with managing an analytics or data science code base. There’s the verbose nature of SQL, spaghetti code, legacy transformations, confusing data lineage, temporary analyses sitting comfortably in the middle of production reports, non uniform data types, lack of consistent naming conventions to name a few. At Intercom, we’re not strangers to these problems either.

It was in this context that a little over a year ago we decided to implement dbt to manage our data transformations. Today we reflect on what we like about dbt, what we don’t like as much and some of the lessons that we’ve learnt from implementing dbt in an existing code base which already contains a ton of transformations.

What’s cool

Let’s begin with the good parts.

Organised code base

We love how dbt is opinionated in terms of the folder structure required to organise your code base. While it took a couple of iterations for us to get it just right at the beginning, we now have a consistent way of storing our core facts and dimensions, our aggregate reporting models and exposures. All the intermediate stages required to build any of these models are co-housed in a subfolder with the model they’re used in.

As a result we have a code base that’s consistent, easy to read and easy to onboard people on to — if you can figure out how one model works, you can figure them all out.

Staging models

Staging models are a great way to ensure data consistency across your code base. We use our staging models to combine data from partial sources into a common “source” table, rename columns to follow consistent patterns, unpack JSON columns and typecast columns to predictable data types.

This turns out to be really useful because there are often inconsistencies in the way data is stored in your production tables because of legacy reasons e.g. binary columns that can be TRUE/FALSE but also 1/0 and these minor differences can sometimes result in unexpected errors further downstream. The staging layer gives us a chance to intercept some of these inconsistencies and fix them close to source.

Snapshots

If you’ve tried to create slowly changing dimension tables using only SQL and Airflow (or your orchestration tool of choice) you know how laborious this task can be. You can either take a complete archive of your table everyday and accumulate data volume really quickly, or you go the pure SCD route where you take the current state of a table, compare it to the older state of the table, handle the rows that have changed, handle new rows that were inserted, handle the rows that were deleted at source and so on.

dbt abstracts all of that logic into the snapshot command which makes it super easy to create these dimension tables. You feed a reference to the model that you want to convert to a dimension into the snapshot command and dbt takes care of the rest for you.

Lineage and orchestration — in other words “ref” and “source”

The “ref” and “source” methods of defining model references in dbt are simple and effective. Instead of pointing to the name of a production table, you point to a “source”. You can change the underlying tables used in this source model and all the downstream models adapt to change automatically.

Similarly, the “ref” method constructs a dependency graph for your models. Before dbt, if you had to propagate a change through some intermediate stages, you would have to manually execute the queries to build each of those models all the way through to your final model. For example, let’s say you have an aggregate table that reports on your organisation’s ARR through the years. You’ve discovered that one of the accounts has been tagged into the wrong geographic region and hence some of your revenue looks like it’s coming from EMEA where it should’ve been APAC. To correct this, you’d first have to fix the dimension table that holds account information and then rebuild the aggregate table that reports on the ARR.

However with dbt, you can simply run dbt run -s dim_accounts+ and dbt will update all the models downstream of the accounts dimension table. This was a rather simplified example involving a two-step transformation but real world reports are hardly ever that simple and usually involve multiple intermediate stages that need to be updated before your changes are reflected in your final report.

Macros

One major shortcoming of SQL has always been the lack of the ability to use functions or methods. Yes, you can create UDFs, but you have to “install” them onto your warehouse or SQL engine and their underlying definitions often get stored on an external code base. Every time you make a change, you have to reinstall them onto your warehouse and the whole process lacks the simplicity of functions.

User defined dbt macros do just that.

One of the ways we’ve employed macros that we’re quite proud of is how we deal with sensitive data in slowly changing dimensions. Slowly changing dimension tables keep a track of the entire history of changes for a particular entity. This history contains sensitive information too. Now, when the source record has been deleted from source — let’s say when an account is closed — we need to make sure that we remove every trace of data that can be considered sensitive. To solve this problem, we built a macro that runs as a post hook on every snapshot to scrub sensitive data from the historical records of deleted entities in our dimension tables.

Another particularly useful macro computes the schema your models should be materialised in depending on whether they’re running locally or in production and creates separate developer schemas for each of our dbt builders automatically. You can turn your frequently used query snippets into macros, choose to run a slightly different variation of the query based on your environment, limit the data you’re using while you’re developing and iterating — the sky’s the limit.

To make things better, dbt also ships with some useful macros built in as dbt_utils that capture some broad but fairly common use cases we all encounter. Ever done a group by 1, 2, 3, 4, 5, 6, 7, 8, 9? That’s dbt_utils.group_by(9) for you. Similarly, dbt_utils.union_relations will perform a safe union of two tables, even when they don’t contain the same set of columns.

Tests

While in-built dbt tests are fairly basic, they’re just so simple to implement it’s hard not to like them. Even things like checking if certain columns are always unique, not null or contain values within a predefined set can help you build broad test coverage of your models very quickly.

Out of the box, dbt ships with four generic tests — not_null, unique, accepted_values and relationship. In addition to the in-built tests, dbt also allows you to define your own tests and the process is pretty simple. Essentially, all you have to do is write a SQL query that returns “unacceptable” rows and that can be turned into a test.

We added generic tests everywhere and were frequently surprised by how many issues they helped us catch that could have potentially flown under the radar for a few weeks.

Documentation

dbt takes all of your model declaration yml files and turns them into beautiful documentation out of the box. In addition, it even produces beautiful visual representations of the dependency graphs for your models making it super easy to see how a model was built. Not a single extra thing has to be done to avail of this feature.

The documentation that comes in-built with dbt is better than most organisations will build over their lifetime. ‘nuff said.

What’s less cool

Now for some things that we didn’t really enjoy as much.

Barriers to entry

There is a bit of a barrier to entry for first time dbt users. While 90% of your dbt code base is SQL, the other 10% definitely takes some getting used to and there’s a steep learning curve at the beginning. Even the process of breaking down your code into staging, intermediate and core layers is not immediately obvious and it can take time for people to internalise that mental model.

Further, unless you’ve opted to buy dbt’s cloud offering, you do need to have a fair amount of comfort and familiarity with the command line to execute dbt models and make the most of its features.

As a result, it can be a bit of a challenge to implement dbt in a team where people only know how to work with SQL. This is an important thing to consider when deciding if dbt is the right tool for your organisation.

Ease of Iteration

The same “ref”, “source” and macros that make dbt so useful in production can sometimes be hard to work with when you’re iterating and developing locally. With SQL, you can simply copy your CTEs or subqueries and execute them in your SQL client to see what sort of result they produce. However you can’t do that with your dbt models because they are written in jinja2 that compiles to SQL.

To get around this problem, you either need to subscribe to dbt cloud or work through a lot of plugins and configuration to get your IDE working just right. Or alternatively you could compile your code every time you’ve made a change and then execute pieces of the compiled SQL, which is often more difficult to read as compared to your dbt models. The lack of ease of iteration can add some frustration to the development process and we’ve felt that too.

Too much boilerplate

Sometimes it can feel like it takes too much boilerplate to build simple things with dbt. While one can argue that all of the boilerplate — the YAMLs, the refs, the sources all play an integral role in the dbt universe, it doesn’t negate overhead of the added process to get anything done.

In addition to the two points added above, all the boilerplate also hampers the development experience of dbt a fair amount.

How to implement dbt when you’ve already got a lot of transformations

So you’ve gone through your list of considerations and decided that dbt is indeed the right tool for you — congratulations! The next challenge we often face is, how does one implement a completely new dbt based analytics project at an organisation that already has a ton of transformations. We certainly faced this situation at Intercom and here are a few things we’ve learnt from our experience.

Know your win conditions

Implementing dbt is going to be a fairly daunting task to get started with and there is a lot of front-loaded work to be done before you start seeing any real results. At such times, you can find yourself wondering why you’re doing this anyway. That brings us to the point — define your win conditions at the get go.

What are you trying to achieve when you implement dbt?

  • Reduce the number of errors in your models?
  • Bring more consistency to your data products?
  • Improve the development and iteration speed of your data team?
  • Decrease the amount of time lost in doing duplicate work?
  • Have clear lineages and defensible definitions of your metrics?
  • Make it easier to onboard new people on your codebase?

All of these are good reasons to implement dbt — but take the time to understand which of those make the most sense for you.

Land and expand

When implementing a new system, there’s always a temptation to chuck out all of your existing models and replace them with the new shiny thing. This approach is risky because you don’t yet know just how difficult or easy the process of transformation will be, or even if you’ll like the end result. You can quickly find yourself juggling more things than you can manage.

Instead, try and pick a report that’s used frequently and that you often have problems with that could potentially be resolved with dbt. With this report in mind, build a conceptual tree of facts and dimensions you need to arrive at this final result and start building those models out in dbt, all the way from your source / staging layers to your final exposure.

Once this process is complete, you can evaluate if you’re still happy with your decision to go with dbt. If you are, you’ll often find more use cases that are adjacent to the models you’ve just built. Those become good candidates for transformation next.

Separate your schemas

When you’re building a dbt project in the midst of other transformations, it helps if you can separate your project into a new schema. Not only does this give data users a clear indication as to the role of a given model, it also gives you additional control as you can have different security protocols for each of these layers and even track usage independently to see which kinds of models get used the most.

We suggest separating your schemas as such:

  • base — materialised staging models
  • prep — intermediate models that don’t serve a standalone purpose and only exist to help you create other model
  • core — facts and dimensions
  • mart — aggregate metric or reporting tables

However this can be quite subjective and you should see what works best for you.

Conclusion

One year in and we’re still in the process of migrating our enormous analytics code base to dbt. Over this period we’ve iterated over our designs quite a few times, faced some frustrations and enjoyed many of the benefits that dbt has to offer. We hope this article helps you make an informed decision as to whether dbt is right for you!

--

--