You’re running dbt in CI, now what?

CI provides a ‘data-airlock’ to check your data before merging

Dave Flynn
In the Pipeline
4 min readJun 7, 2024

--

Have you seen the excellent series of LinkedIn posts by prominent dbt community member Bruno Souza de Lima about running dbt core in CD/CI? It’s been great to follow Bruno’s sample repo as it evolves following community feedback.

The workflows that Bruno created will help with continuous deployment (CD) and continuous integration (CI) for your dbt project.

  • The continuous deployment workflow builds your dbt project into production each time changes are merged into main.
  • The continuous integration workflow builds your modeling changes into a temporary, PR-specific, schema.

So, why do you need these workflows for your dbt project? Can’t you just make a change and then merge it to main?

Well, yes, but, and it’s a big but 👀, if you do that you’re living dangerously, and it’s only a matter of time before you let some bad transformations through and screw up prod data.

An airlock between dev and prod

By having these two workflows, you can segment the process of applying changes to your dbt project, which gives you a chance to check things before merging. Like an airlock between dev and prod.

Don’t let them in!

The PR schema generated in CI provides a temporary home for your newly modified data. A space where you can perform a proper review of the data before you merge to main. The review is done by:

  • comparing historical data values between prod and the PR schema to check unexpected impact didn’t occur;
  • and validating the data generated by your modeling/code changes.

This is a simple, yet effective, configuration that allows for data checks, but for certain projects, you will need a more robust process.

Trying to hit a moving target

The success of using separate environments to validate data model changes relies on your ability to compare those environments. Production data can be a moving target so, really, it’s not a reliable base to compare against.

Trying to compare dev against prod is like a moving target

You’ll also need a more reliable configuration of this workflow if your dbt project is:

  • very large;
  • frequently modified;
  • has frequent data ingestion;
  • has multiple PRs open for longer periods;
  • or the data are used for business critical applications.

To successfully compare environments, the data needs to be the same — There can’t be any inherent inconsistency in the data. That’s where some deft use of staging environments comes into play.

Immutable base environment

A more robust solution involves creating an immutable base environment that will act as the base for your comparison, and the source of data for your PR-branch models. In the simplest form, the steps are:

  1. Create a regular snapshot of production
  2. Clone that production snapshot to a PR-specific schema to be the base environment
  3. Generate your branch data models from this base
  4. Compare your branch to the base for impact assessment

It’s a more complicated process, but the effort is worth it, especially if you simply cannot afford a bad merge.

Create an immutable base env for data checks

Each time CD runs and deploys production a staging environment is created from a subset of data — a day, a month, whatever is suitable for your project.

Here’s the neat trick — when a PR is opened, the staging env from the CD job is cloned into a PR-specific base env. It’s from this base env that the branch models are built.

This ensures that you have a static base environment to use for the comparison with your branch models. The data will be exactly the same, and therefore the results of any data validations will be accurate and you’ll be able to verify your work, or spot any unintended impact.

Conclusion

This is a high level overview of the process, but keep an eye out for more hands-on technical examples of how this would work in practice. If you have any feedback, adjustments, or better methods, please post a comment or send me a message on LinkedIn.

In Summary:

  • Comparing dev against production can be like trying to hit a moving target
  • Create regular staging snapshots that are a subset of production data
  • Clone the snapshot for each PR
  • Build your branch models from the cloned snapshot
  • Use the cloned snapshot as the base for your data checks

--

--

Dave Flynn
In the Pipeline

Dave is a developer advocate for DataRecce.io — the data modeling validation and PR review toolkit for dbt data projects