The Ultimate Guide to Continuous Integration with dbt (CI/CD, Data)[IN PROGRESS]

Understand how to leverage Git actions to merge your data from Dev into Main at the speed of light

Hugo Lu
2 min readApr 9, 2024

Introduction

I’m conscious that there isn’t really a set-in-stone way where people run Continuous Integration for their dbt Projects. I’m also aware it varies significantly from warehouse to warehouse, and providers such as Y42 actually ask consumers pay for this as a standalone service.

There are some basic principles that we need to be aware of here, and these are detailed to the best of my knowledge in this article. I very much welcome feedback and intend for this article to be an ongoing resource that is continually updated.

Please reach out to me on Linkedin or the dbt Slack channel if you’re interested in speaking about this and contributing, and I will mention you at the end.

Cheers!

Hugo

Interested in running running dbt-core on ECS but still want to keep end-to-end lineage and monitoring? Check out Orchestra.

Dev Environment

A Dev Environment should allow the user to easily create ephemeral branches in the form of dev_{user}_{id}.

When selecting from source tables, models should be created via a clone command. This applies to all tables affected by the Pull Request. These models should be created in a dev environment.

[TBA]

Merge to Main

On Merge to Main, the goal of the Git Action should be to determine which models have been affected by a Pull Request. This should be done by parsing the SQL submitted in some way, and possibly by examining dbt artifacts.

Firstly, the dev models should be cloned into a new staging environment. This allows the tables in the Dev Stage to remain in their current state, should they need to be subsequently edited.

The Git Provider should then rematerialise and test upstream models accordingly. The layering of this should be parameterisable (how many models down do we go?) and we should be able to add an “exclude” flag (which excludes a specified model, and anything downstream).

The Git actions should be intelligent enough to know when a full refresh is needed.

On success, the Git Action should essentially replace the production tables with the staging tables by leveraging zero copy cloning.

After this is completed, the Git Action may trigger a rematerialisation of downstream processes e.g. dashboard refreshes.

[TBA]

Multi person team approach

With multiple people, it becomes important to understand PRs in the context of a queue. A queue is leveraged to handle multiple pull requests, all essentially requests to merge to main, in an orderly fashion.

[TBA]

Large team approach

[TBA]

Multi-Repositories

[TBA]

Cost considerations

[TBA]

Resources

  • Orchestrating dbt CI/CD workflows with Gitlab — link
  • Gitlab dbt handbook — link
  • dbt Cloud Slim CI — link

--

--

Hugo Lu

Hugo Lu - I write about how to be good at Data engineering and do the coolest data stuff. I am the CEO @ Orchestra, a data release pipeline management platform