Building dbt CI/CD at scale

Published in

Checkout.com-techblog

10 min readApr 25, 2023

At Checkout.com we heavily use Data Build Tool (dbt) as a tool for data transformation, which is a SQL-based open source tool enabling data analysts and engineers to transform, test and document data in the cloud data warehouse, which in our case is Snowflake. With over 27 dbt projects focusing on different aspects across the business, 100+ dbt developers contributing to these projects every week, 1000+ models running and extensive data volume floating each day (around 7bn API interactions per year) we are often faced with unique data challenges. Dbt provides a hosted solution: dbt Cloud. While this is a powerful tool for managing data pipelines, it has limitations on the level of customization available compared to running dbt on your own infrastructure. Therefore, to meet the requirements of each team, while ensuring only the highest quality of data is being pushed to the production environment to meet the data standards our stakeholders rightfully demand, we decided to develop our own dbt CI/CD pipeline architecture. We found that by leveraging GitHub Actions with Slim CI and customized workflow scripts we can enrich our dbt pipelines to meet the needs of all of the teams across Checkout.com.

This article will cover what CI/CD is, our linter of choice, how we bring the S3 bucket into action to take advantage of Slim CI and the overall design of the dbt pipeline within which we pause Airflow jobs and minimise our Snowflake costs in various ways.

dbt CI/CD overview

Continuous integration (CI) and continuous deployment (CD) within the dbt sphere refers to the data modelling development lifecycle. In short, it is an endless loop with steps aiming to improve the data in production. Each time a dbt developer writes a code, it is first built or compiled together to test their changes and look for any potential bugs. If correct, changes are deployed to production for the end users to benefit from the adjusted data. These are in the end monitored and the feedback is collected to plan further improvements and repeat the cycle. You may have come across the below graph that visually explains the process:

Pipeline overall design

The dbt CI/CD pipeline is centrally managed within the Company by the Data Platform team, which focuses on maximising the time business stakeholders spend on solving business problems and minimising time spent on technical details around the implementation, deployment, and monitoring of their solutions. We store the workflow scripts within the shared CI/CD library utilising GitHub Actions reusable workflows to avoid the need for each team of copying and pasting workflows across their repositories. Teams within their dbt repositories only need to specify a handful of workflow variables, which mainly allows them to have flexibility around some properties related to the workflow configuration such as the dbt version or dbt seeds deployment.

The centrally managed library additionally allows everyone to contribute to the workflow configuration and teams automatically benefit from any additional features.

Before we dive deeper into the dbt workflow configuration there are a few important things that need to be highlighted. The first one is, our dbt commands are performed within the officially published dbt container, which you can easily set up by adding the below code within your workflow job.

container:
  image: ghcr.io/dbt-labs/dbt-snowflake:${{ inputs.dbt-version }}
  credentials:
    username: ${{ github.actor }}
    password: ${{ secrets.GITHUB_TOKEN }}

Second, we are using the Slim CI because the feature enables faster CI feedback as well as reduces warehouse costs. It does so by running and testing only the set of new and modified resources within a raised pull request (PR) instead of everything.

Third, within our CI workflow, we make use of GitHub Actions concurrency, to ensure that only a single job or workflow using the same concurrency group will run at a time. This function additionally limits data warehouse costs if a developer pushes extra changes while the previous workflow is already triggered.

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

Finally, Slim CI needs to compare the dbt manifest file from the last successful run with the one for the current run. In this way, dbt is able to determine what has been changed. We thus publish a new manifest file to the AWS S3 bucket each time a PR is merged as well as each time we deploy our models to the production environment. Publishing two separate manifests allows teams to raise multiple PRs and ensures modified models are always correctly picked up independently between the PRs and deployments. The only issue with this design is that, in a containerised environment, GitHub runners lose the AWS Identity and Access Management (IAM) policy to access the S3 bucket. Therefore, we run three jobs that make use of upload-artifact GitHub Actions function:

- uses: actions/upload-artifact@v3
  with:
    name: dbt-manifest
    path: ${{github.workspace}}/artifact/manifest.json

And download-artifact function to pass manifest files between the workflow jobs:

- uses: actions/download-artifact@v3
  with:
    name: dbt-manifest

When it comes to the CI build, the process happens in a few steps and dependently whether a team provides a deployment script or not, the workflow performs slightly different actions that I will explain later.

Continuous Integration

SQL linter

As a first step, we recognise the need of having a consistent SQL / dbt code style within our organisation to limit the time spent on understanding the code by our dbt developers and be consistent across projects. The linting tool that we chose to use is a SQLFluff; being the most starred SQL linter on GitHub with an option to customise SQL rules. To minimise Snowflake’s costs of running models each time a push is made to the feature branch, the dbt build is not triggered while the PR is in draft mode. However, the linter job does check the correctness of the code, giving developers feedback prior to raising their PRs for review.

The output of the job clearly highlights what needs to be changed to adhere to implied rules.

Once everything has been adjusted and the linting job runs successfully, our developers raise their PRs to start the dbt CI process and get a review from their fellow project teammates.

Validating model changes

As mentioned earlier, our model validation job uses Slim CI to run checks on modified models. However, to give our dbt developers flexibility we also give them the option to run custom dbt scripts within which they can specify models they would like to test and deploy. This also allows them to run dbt run-operations such as adding a new column without performing a full-refresh or refreshing a table from a specific date onwards. They may do so by providing a deployment script/s within the deploy/scripts directory that workflow run checks to see whether anything has been added or modified. We retrieve this information by performing a more complex version of a git diff command:

Figure 5: Deployment script(s) check

Based on this, the CI workflow run performs slightly different actions to validate model changes.

The feature database is created to safely validate modifications against the production data while ensuring that the production data is not touched prior to the PR approval and deployment. Once all checks have been performed this database is dropped. To limit the time taken on the database creation, our developers use a retention period. You may find out more about it in this article written by one of my colleagues.

As the diagram above shows, the validation process first runs models to create tables/views, then runs incremental checks to finally run dbt tests. Once all checks have passed and the PR has been merged the workflow adds a few more steps on the main branch.

The first visible one is that on a push to the main branch, a new manifest file is uploaded to the S3 bucket. The second one happening in the backend is a creation of a new Git tag.

- name: Git Tag
  if: ${{ github.ref == 'refs/heads/main' }}
  run: |
    RELEASE_VERSION="v3.${{github.run_number}}"
    git tag $RELEASE_VERSION
    git push -f origin $RELEASE_VERSION
  shell: bash

Tagging allows us to manage and handle releases. We use it as a version control mechanism to track and manage changes made to code over time. In case of any bugs or issues with the latest deployment, we can easily rollback to the previous release. This helps to minimize downtime and reduces the risk of losing valuable work.

However, the Git tag creation does not happen if the model validation process doesn’t pass which highlights the third difference on the main branch. On a push all modified models are checked even if a deployment script is provided. We do so because, while developers might want to deploy only some of their models, we want to ensure that only pushes with working models are tagged. If that is not the case the workflow run fails and the Git tag process is skipped blocking developers from the deployment of faulty changes.

Deployment process

Figure 8: Continuous deployment release logs

With GitHub Actions inputs you may have an endless loop of variables you can pass within your workflow to enrich your deployment configuration. Within Checkout.com we only pass a few during the deployment process, which we find useful to limit the complexity.

Figure 9: Continuous deployment variables

The first one reflects the environment in which our developers would like to deploy their changes. While GitHub Actions allows selecting a branch for your workflow, we put checks in place, which allow deploying to production only from the main branch.

Dbt seeds refreshment is also optional since projects within Checkout.com rarely use them. Finally, the large deployment is a possibility for dbt projects to scale up their warehouse during the deployment. We give this option since Snowflake charges based on the number of virtual warehouses used, how long they run, and their size. Consequently, we found that in some cases when dbt developers know their deployments will take an extensive amount of time, it is cheaper to scale up the virtual warehouse size and limit the time spent. For that reason, if selected True, the workflow consumes the pre-deployment virtual warehouse size and saves it as an environment variable. Then scales it up during the deployment and in a separate job it always scales it back, even if a deployment error occurred, to the pre-deployment warehouse size:

Figure 10: Scaling up of the virtual warehouse size

If you followed the deployment CD deploy logs carefully, you might have also noticed information regarding the latest release branch within Airflow. As part of the deployment, not only tables in our data warehouse are refreshed but also the latest Git tag that was created during the CI is updated within Airflow. It is saved as an Airflow variable that is later consumed by an Airflow DAG to refresh tables on a schedule. Each dbt project has a designated branch variable to distinguish between projects.

Figure 11: GitHub tag saved as an Airflow variable

To prevent two versions of code from running at the same time we first pause all relevant DAGs based on the project name value and resume them after the tag update.

Figure 12: Workflow step waiting for Airflow DAGs to finish their runs

As a default, the process waits for DAGs to finish their runs if those are in progress. Nevertheless, teams have a possibility to specify within their workflows to kill DAGs regardless. They might prefer this option in cases when their DAGs start to run immediately after a previous run making their deployments very difficult. As you can see in the above picture it took nearly 1h and 50 mins to pause the DAGs while below only 11s.

Figure 13: Workflow step killing Airflow DAGs

As a final step, the new deployment manifest file is uploaded to the designated S3 bucket from within which is consumed during future deployments and voilà! The latest changes have been checked and successfully deployed to benefit the stakeholders.

Final thoughts

Overall, dbt Cloud is a powerful tool for managing data pipelines, but like any platform, it has its limitations. It is important to evaluate your specific needs and use cases to determine what is the best solution for your team. We found that customised dbt workflow leveraged with GitHub Actions brings the most value to our organisation. It allows us to combine the Slim CI with enriched options such as customised scripts, data warehouse alter options and support for different technologies that we use. On top of the above-mentioned Snowflake, Airflow and S3 integration, you may incorporate AWS Secrets Manager to store and rotate your passwords. We additionally take advantage of the dbt Artifacts being uploaded to the S3 bucket as metadata for other external tools. In our case, we extended their usage to DataHub as our Data Cataloging service (look out for the upcoming article around this!) and Monte Carlo which we adopted as our data observability platform to further ensure data quality standards across our organisation are met.