Setup a Slim CI for dbt with BigQuery and Docker

Published in

Teads Engineering

7 min readDec 14, 2021

dbt is getting increasingly adopted as it is a handy tool to manage data transformations. It really makes sense when teams and projects are growing to industrialize SQL workflows. The project is really taking off and is also ramping up at Teads as a replacement for one of our in-house tools used to run queries on BigQuery.

As dbt projects grow and multiple contributors add and modify queries on a large codebase, continuous integration is a must-have to avoid regressions. However, running every model and test can quickly take up to half an hour and even beyond. Slim CI, is the lightweight version of CI where we’re aiming to only run and test what is relevant to ensure any change doesn’t break anything.

If you’re using the dbt Cloud offer, dbt Labs have already sorted that out for you. However, if you’re using dbt core directly, then it’s worth looking into a CI integration.

Our setup

If you’re working with dbt, it’s likely that you already bootstrapped your dbt project with your different queries following the documentation. We’ll assume in this guide that:

We are using a git platform that supports webhooks (Github, Gitlab, …)
We are connected to the repository with a Continuous Integration platform (Jenkins, GH Actions, Circle CI, …)
We have set up a Docker registry
We have a GCP service account set up with BigQuery rights

That way, we’ll be able to run a test script that will execute the dbt project queries and push the appropriate Docker image to our docker registry.

The big picture is the following:

Retrieve the state of the previous CI run
Run the models & tests for those that changed in the past
Store the updated state for further runs

Database credentials (BigQuery)

In this example, we’ll take BigQuery as the backend for the dbt project. Doing so, the first step is to retrieve the credentials. It’s hard to provide a generic solution to do so, there are multiple secure options to store credentials.

One convenient way we use at Teads is to have a GCP service account dedicated to our dbt project and use AWS secret manager to hold our credentials for our Jenkins (running on AWS). Then, thanks to the instance role and AWS CLI, we can retrieve the credentials directly from the CI script to use them with dbt.

Note: If you’re using a file as a buffer and your CI isn’t running with an ephemeral workspace, don’t forget to clean up at the end of the script.

To run dbt properly on the database, you’ll need to add enough GCP access rights to that service account on the project used for the CI. we’ll need to ensure we have:

roles/bigquery.dataEditor
roles/bigquery.jobUser

Then we can retrieve the credentials through shell commands using AWS CLI Docker image:

We usually work with shell scripts at Teads to run commands but you can adapt it to your runtime environment. I’ll reference that script as the ci.sh in the rest of the article.

Profiles.yml for dbt

As dbt is working with profiles, we need to define the CI profile so that dbt writes in the correct GCP project and appropriate dataset. In our example, we’ll use a keyfile as it’s a convenient way to authenticate a service account. For other authentication methods, we can have a look at the dbt-bigquery documentation. I suggest using an environment variable to determine the key file path as it avoids duplicating the value.

Here how our profiles.yml look like:

It can be stored directly in the dbt project along with the test script.

Dockerignore file

If you’re familiar with docker, you probably know copying the whole folder is convenient but it tends to retrieve way more than necessary. That’s where the .dockerignore file comes into play.

Let’s assume we have a default project and then our .dockerignore look like:

Compiling and running dbt

Once our credentials are ready, we can compile our models and then run & test them. I personally like to use a docker container to store the models as it allows to precompile them for production usage.

Let’s assume the docker image containing the queries is named dbt-queries and we’re going to create a Dockerfile to generate it. We’ll use Docker BuildKit features to avoid storing credentials in the docker image.

We’ll prepare the CI process in 2 steps:

A full run version docker
A slim CI version docker bootstrapped by the full run version

Full Run version

With the full version, we’re going to test every model on each test script execution. Doing so will generate a manifest.json file that will contain our state. It will be useful to build the slim CI version.

Here is the Dockerfile with commands to guide you:

With that Dockerfile, if all steps are successful, we’ll be able to push the resulting image to our docker registry and use it to run the branch in another environment. To do so, we’ll add the following code to our ci.sh script:

As you can see, we usually push both the commit and the branch. Pushing a branch is convenient if you need to test a configuration in a different environment. On the other hand, hashes versions are great if you need to freeze a version while you’re continuing the development in the same branch.

Now you’re done with the full CI approach!

Slim CI version

It really affects the developer’s workflow if we don’t run/test models on pull requests or run/test everything which can take a lot of time once our project reaches a certain size. dbt Core provides selectors which can be combined with graph operators, set operators, and methods. Using all of them together allows us to build a slim CI. Joel Labes wrote a great article as well on slim CI that helped me to improve the operators we use.

Full CI approach is a good starting point for the slim CI version. We’ll update the script to add a few steps:

Copy the state file from the reference manifest
Update the selectors to use the reference state and run CI only on modifications

Altogether these changes look like this:

🎉 And voila!

Now with a proper Github hook, you should be able to trigger a slim CI build. This article doesn’t cover all use cases though. For instance, you might need to use seeds or snapshots. Then you’ll have to do similar commands for those cases. Feel free to share your use cases in the comments.

Few tips to go further

You can adapt the thread numbers to fit the environment running the CI to improve the performance.
Storing the state and the models compiled for production in the same docker image is not optimal, though it’s convenient for small to medium dbt projects. As a project grows or if you would like to take another approach like embedding credentials in the image, then it would justify either splitting the image in two (CI & production) or storing the state on cloud storage (S3, GCS, …)
State comparison is working very well but it’s not perfect and documentation is a great resource to understand the related limitations.
State might not be forward compatible, so if you’re moving from dbt 0.21.0 to 1.0.0, it will fail as is. You are likely to save yourself some time to just run and test everything to start from a fresh up to date state.
To run in production, you’ll just need to prepare the environment (production profiles.yml, GCP credentials, docker image tag to be run, …) so that your docker command will look like the following:

docker run --rm --name my-job-$BUILD_NUMBER \
-v “$(pwd)/profiles.yml”:/usr/app/profiles/profiles.yml \
-v “$GCP_SERVICE_ACCOUNT_JSON_PATH:/auth/gcp-service-account.json” \
-e GOOGLE_APPLICATION_CREDENTIALS=”/auth/gcp-service-account.json” \
-e DBT_PROFILES_DIR=”/usr/app/profiles/” \
-u root \
$REG_URL/dbt-queries:$TAG \
run --models my_model

If you don’t use data-related tests for some models, you can add a macro to limit your inputs for the CI target. Going for a LIMIT 0 will still run the query but at no cost and, obviously, much faster:

{% macro limit_ci() -%}
{% if target.name == 'ci' -%}
LIMIT 0
{%- endif %}
{%- endmacro %}

With that macro added at the end of your Select command, you’ll be able to validate that the query has a valid syntax for SQL & UDFs.

As is, if you run concurrent CI jobs in the same project, you’ll end up with collisions between runs because two changes that are not compatible might be done on a similar model. In that case, a solution is to isolate those runs. One way to do so is to use timestamped runs. dbt macros are a great way to leverage that approach:

On your next runs using ci target, this macro will concatenate the timestamp from the run start time to your model's output aliases. One caveat to that approach is that you will create a lot of tables, so we apply a table expiration setting to the ci destination dataset. Another possible way would be to use a post-run hook.

That’s all for now, don’t hesitate to give us some feedback in the comments.

🎁 If this article was of interest, you might want to have a look at BQ Booster, a platform I’m building to help BigQuery users improve their day-to-day.

Home

You never know where to start to analyze your queries? The query analyzer tool allows you to quickly browse your…

bqbooster.com

Also I’m building a dbt package dbt-bigquery-monitoring to help tracking compute & storage costs across your GCP projects and identify your biggest consumers and opportunity for cost reductions. Feel free to give it a go!

GitHub — bqbooster/dbt-bigquery-monitoring: dbt package to monitor BigQuery assets (tables &…

dbt package to monitor BigQuery assets (tables & queries) — bqbooster/dbt-bigquery-monitoring

github.com

Many thanks to Benjamin Davy and my colleagues for the review.