A simple implementation of Continuous Integration for Data Warehouse development with DBT

Landbot Data Team
Landbot Engineering
6 min readAug 2, 2022

Patricio Villanueva, Luis Martínez and Arnau Tibau Puig — Landbot Data Team

Introduction

At Landbot we use dbt to develop the code that defines the databases and tables in our Data Warehouse (DWH from now on), as it allows us to quickly develop data models and ensure their quality.

One of the main concerns in our development process is to ensure that the data is accurate and that it remains accurate as we introduce new models or update existing ones. To this end, we decided to implement the gold standard in software development, which is to (1) continuously run a suite of tests on our data models with every Pull Request (Continuous Integration) and (2) automatically trigger a production build with every merged Pull Request (Continuous Delivery). In this post, we will describe a simple way to implement a Continuous Integration development workflow (CI from now on) with DBT Cloud and Gitlab/Github.

With the implementation that we describe below, we were able to speed up the development cycle of our models, while maintaining data quality and saving costs. So, if you are looking to implement CI in a similar way, this blog post is for you.

Our first approach to CI

When we started to implement our DWH, we decided to adopt DBT and DBT Cloud. The former (DBT) allows us to transform our ingested data into usable models and test them to prevent inconsistencies, while the latter (DBT Cloud) allows us to schedule the jobs to build our DWH, and generate documentation so that our users can understand what data we have and where.

Our initial setup was such that:

  • Each developer had an isolated dbt schema that acts as an isolated development environment,
  • Every Pull Request triggered a build and test job in a shared CI dbt schema,
  • Merging to master triggered a build and test job in our production environment.

To put a new model or a change into production, we requested that:

  • All changes went through a peer-reviewed Pull Request,
  • The CI pipeline succeeded.

The entire flow is illustrated in the diagram below:

Our initial approach to develop and deploy data models. Two developers

However, as our team and the importance of the data warehouse grew, we started updating it more often, adding more and more tests, oftentimes resulting in multiple Pull Requests open at the same time. In the above setup, all the pipelines triggered by our Pull Requests were run and tested in the same shared dbt schema, which acted as our CI “environment”. Within it, each CI job would build all the models from scratch, resulting in long execution times and unnecessary data processing costs.

As a next step, we transitioned to a Slim CI execution mode. This execution mode allows dbt to build and test only those models that have changed relative to a previous reference run. As a result, each Pull Request only triggers a build for the models that have been modified (and their children if you wish). Let’s visualize this with an example. Let’s consider the following toy lineage graph:

Example of a simple lineage graph. We have three tables: A, B, and C which depends on A and B.

By using Slim CI, we can reduce CI runtimes and data processing cost: with it, each CI run only builds and tests the subset of the lineage graph that needs to be re-run:

Pull Request #1 modified Table A. Pull Request #2 modified Table B and C. Slim CI triggers a job to build the modified tables and their children, hence PR #1 only builds the tables in red, and PR#2 only builds the tables in green.

Unfortunately, the Slim CI execution mode introduced a different challenge: our CI would now fail at times because of conflicting changes across concurrent Pull Requests. To see why, consider the case where two concurrent Pull Requests modify the same table C (following the example above):

We open PR #1 which triggers the CI, this one is successful and updates the CI dbt schema. Then we open PR#2, which will run on top of the changes from PR#1, and trigger a new CI. This last CI fails because the changes introduced by PR #2 are incompatible with the state resulting from building PR #1.

DBT blog posts to the rescue… or not!

Upon realizing this issue, we decided to scan the dbt literature to look for existing solutions. Below is a list of the approaches we found:

  • Deferring the previous CI pipeline in the DBT Cloud. The problem with this solution is that DBT only refers to one previous successful run. This is because if the dbt run command was successful but the dbt test was not, the tables are still updated in our CI schema, causing the same potential error that we were trying to avoid.
  • Running dbt compile every X minutes. This will create a new manifest.json, with the schemas and data types of the tables present in the CI schema, allowing us to compare the latest version of this one against the PR’s tables schemas and data types (check Vimeo’s implementation for an example). However, if between a dbt compile execution and the next one, there are two Pull Requests opened then the run order would be PR1, PR2, dbt compile, which will mean that PR2 can’t benefit from using the latest manifest.json as a reference.
  • Running the CI outside of the DBT cloud (for example in your own execution environment). We could in principle run the command dbt run — select state:modified+ — defer — state path/to/prod/artifacts locally to use any previous run we want as a reference. Unfortunately, this approach requires us to provide our own CI execution environment which is too costly to implement.

To each their own (schema!)

None of the above solutions were suitable for our needs, since they would still require sharing the CI schema (with potential corner cases that would cause our CI to fail) or spending resources to run the CI jobs outside of the DBT cloud.

As usual, the solution was obvious in hindsight: stop sharing the schema for all CI runs and isolate each one into its own. Fortunately, it turns out that DBT Cloud already has a feature to enable just that.

This feature (which we show how to configure in the section below) gives you the option to generate a temporary schema for each Pull Request, which is automatically deleted when the Pull Request is either merged or closed. Furthermore, it allows us to use Slim CI (dbt run --select state:modified+) , and build only the models that have changed and their children, while the parent tables are being queried from the deferred state that you choose (which typically will be our last production run). See the diagram below for the summarized flow following the same example situation as above:

Two developers working in parallel, both will open a PR to merge their changes. These PRs run in two isolated schemas resulting in no error due to overlapping changes across concurrent Pull Requests. The PRs can then be approved or closed. When approved, then the CI schema is automatically deleted by DBT and the changes are merged into production triggering the CD, which builds and tests all the models. When closed, the CI schema is automatically deleted by DBT.

Implementation in DBT Cloud with Github/Gitlab

Now, if you are interested in implementing a CI/CD flow like ours, this is pretty simple and straightforward in DBT. As I previously said, all you will need is a DBT Cloud account and a GitLab or GitHub account. The following table shows which type of DBT Cloud plan you need depending on where you host your VCS:

Once you know which plan you need, you need to follow these steps:

  1. Go into your DBT Cloud account —> Profile (top right part of the screen)
  2. Go into Integrations.
  3. Connect to your GitHub or GitLab project.
  4. Go to your Jobs section.
  5. In your CI job (if you don’t have one, create one): go to Settings > Webhooks and check RUN ON PULL REQUESTS?

Conclusion

We have proposed and implemented a solution to run isolated CI jobs using DBT cloud and Gitlab. With this solution, we have been able to ensure the accuracy of our models and the scalability of our development flow while speeding the pace at which we develop and keeping our costs as low as possible.

References

--

--