dbt at Super Part 2: Continuous Integration

Published in

Super.com

10 min readOct 12, 2022

Introduction

The first post in this series covered orchestration for dbt projects, including the concepts of proactive and reactive orchestration. I explained our use cases for both orchestration modes and how we do deploy both as efficiently as possible. It’s been exciting to see orchestration become a conversation topic in the data community over the past few months. Just to highlight a few examples:

Dagster introduced Software-Defined Assets and reconciliation-based orchestration, which facilitates the rematerialization of assets like dbt models based on observed changes in the asset’s metadata and upstream landscape
dbt introduced the source_status selection method to dbt Core, which enables users to trigger proactive dbt pipelines based on status of upstream data
Benn Stancil wrote a blog post about how data assets, and their desired freshness, should drive orchestration — and not the other way around

It’s increasingly looking like many of the complexities and redundancies I covered in the first post are being actively worked on and may not be long for this world — which I welcome with open arms!

This post will “shift-left” to the time before the data asset — in this case, a dbt model — gets materialized in production. It will cover Continuous Integration (CI), code linting, and dbt project organization, showcasing the system built at Super (FKA Snapcommerce) to manage the complexity of our dbt project as we scaled to hundreds of models and dozens of users.

What is continuous integration?

If you’re reading this blog post, there’s a good chance you’ve encountered Continuous Integration (CI) before even if you don’t know it. CI encapsulates a set of practices engineering teams use to facilitate collaboration and rapid iteration for software projects. These practices promote frequent contributions to a shared code repository with automated testing in place to detect errors before they enter production.

In the context of dbt projects, many standard elements of CI have already been widely adopted by the dbt community, like maintaining a single source code repository, and automated builds using virtual environments and Docker. CI can also be used to automatically run tests and linting checks whenever dbt code is pushed to the repository and before it gets merged into main.

For example, let’s say your company wants to start incorporating revenue from commissions into its net revenue calculation. You make a change in the mart_financial_metricsdbt model, which shows up on Github like this:

- sale_price - supplier_cost                        AS net_revenue,
+ sale_price - supplier_cost + (0.05 * sales_price) AS net_revenue,

Your code change is reviewed but you don’t have a CI process set up, so you merge it in after approval. The new codebase gets picked up automatically by an orchestrator like dbt Cloud, which starts running your updated model. Suddenly, you see alerts firing in the #dbt-alerts channel. You check the logs, and see the mart_financial_metrics model failing due to a compilation error. You double-check the code and discover that you accidentally wrote sales_price instead of sale_price when adding commission revenue, leading to the error. You quickly submit a fix.

This scenario, and the associated stress, can be prevented with a relatively simple CI process that runs and/or tests the changed dbt model. Other small misconfigurations, like a YAML input error, can be detected even earlier by simply compiling the dbt project. Finally, more advanced tests, tools, and techniques can be used to assess regressions which are harder to detect, like downstream impacts of a relatively innocuous change in business

Continuous Integration for dbt projects

What are we trying to do in CI for dbt projects? The answer is not as obvious as you might think. I’ve seen a multitude of approaches across to running and testing dbt code before it enters production across the industry. Here are a few examples:

Some simply compile their dbt project, and if successful, deploy it to a staging environment for manual testing and QA.
Some build and test modified models into a temporary environment using an approach called slim CI.
Some (like the Shopify data team), don’t test data in CI, but test the model logic alone through unit testing.

For this post, I will focus on the dbt jobs, python scripts, and data warehouse configurations we use to accomplish our version of a robust CI process. I will first talk about testing, and cover the different types of tests we use in our dbt project (data tests and unit tests) and the ones we enforce on every model. I’ll discuss linting, and how to enforce a SQL style guide in your repo. Finally, I’ll discuss project level checks which ensure our data model and project structure kept intact with every new addition.

Compile

Even the most simplistic CI process must fulfill this basic requirement: Ensure the dbt project compiles! Projects fail to compile when there is misconfigured YAML or Jinja, like references to a model, source, or macro that doesn’t exist. If your project fails to compile, it won’t run in your orchestrator, so it’s imperative you ensure compilation before any new code gets merged in. This can be checked using:

dbt ls > /dev/null      # option 1
dbt --warn-error parse  # option 2

The warn-error config is optional, and will raise an error anytime dbt would normally warn. Warnings are raised in when there are YAML configs for nodes that doesn’t exist, so including this flag helps keep your dbt project nice and tidy .

Compilation errors are surfaced during run, test, or build,but we run the above commands early in our CI pipelines to fail fast when there is a simple compilation issue. Specifically we run dbt —-warn-error parsein our lint stage, and the following in our test stage:

dbt --warn-error ls --select state:modified+ --resource-type model

This helps developers quickly identify the models that will be built in this CI pipeline in case they forgot to include new models in their commits.

Run and test dbt models

A central tenet of continuous integration is tests should run in a clone of the production environment. This ensures all possible bugs that could occur in production are uncovered and fixed in the test environment. Luckily for us, cloud data warehouses make it easy to replicate production into an infinite number of CI pipelines, and dbt makes it even easier with Slim CI.

The most important part of your CI process after checking compilation is building and testing your new and modified models in a test environment in your warehouse. Building models ensure that whatever SQL you’ve added or augmented can compile and is free of syntax errors. Testing these models ensures that your new and modified models all satisfy the data quality or business logic constraints you’ve set for them. We run the following dbt job in CI:

dbt build -s state:modified+ --full-refresh --defer

This command builds and tests all modified seeds, models and snapshots in order of their position in the dbt DAG, which has the nice feature of failing the pipeline early when there is a test failure.

When a model has a lot of children, CI pipelines can take a long time. We use a few techniques to speed up development pipelines, but some organizations limit the scope of the state:modified+ selector to only include immediate children.

Test coverage

Testing for dbt projects can be roughly bucketed into two categories:

Data testing: Tests on top of dbt models built using production data, typically investigating data quality like uniqueness, nullability, and consistency.
Unit testing: Tests evaluating model logic by running mock input data through the model and comparing it against expected output.

Both of these types of tests leverage dbt’s built-in dbt testing framework, but only data testing is natively supported by dbt-core, for now.

Over the last year or so, there has been an explosion in unit testing frameworks for dbt, signalling the desire of the community to adopt a more conventional and robust approach to testing dbt models. At Super, we adopted unit testing using the dbt-datamocktool package by Michael Irvine when our models became complex enough to render manual testing tedious and error-prone.

To ensure the models in our dbt project are well-tested, we require each model to possess at least one unique test and one not null test, but developers are encouraged to add more tests, including unit tests, to fit their needs. We enforce this using a custom python script, but can also be accomplished using the dbt-project-evaluator package, which will be covered later.

Model configs

When making updates to lots of models, or sometimes even full directories, it can be difficult to ensure that all configs remain the same. We run a small python script to print out the diffs in configs for changed models, which makes it easy to identify when materializations, tags, or other configs have been changed errantly:

Code linting

Linting for software projects is a process that uses static program analysis to flag stylistic errors, improper conventions, and bugs. For dbt projects, it’s used to ensure your SQL (and now python) code follows a consistent style guide. We use SQLFluff and a custom ruleset to enforce common conventions across our SQL codebase (and in the future, we may adopt something like black for Python models). We execute SQLfluff on new models only, using the following command:

diff-quality --violations sqlfluff --compare-branch=main --fail-under=100

Documentation and YAML

CI processes can also be used to apply dbt project standards as defined by the project owners. At Super, we value robust documentation, so we use CI to ensure that all new models, snapshots, sources, and columns are documented

All new models, sources, and snapshots must have the following:

Node description
Column descriptions
Unique and not null tests (models and snapshots only)
Freshness condition (sources only)
Owner (slack handle)
Channel (slack channel)

The last two are added as meta fields and are used by our observability system to distribute alerts:

meta:
    owner: "@jonathan"
    channel: "ops-alerts

Project and DAG linting

Many organizations that have implemented dbt without strongly enforcing their project structure have come to know the pain of the “spaghetti DAG”:

Image taken from dbt Labs’ “Data modeling techniques for more modularity”

In the spaghetti DAG, models are layered on top of each other without consideration or consolidation, leading to extremely wide DAGs with incoherent model relationships. Not only is this bad for analytics engineers’ sanity, but it’s bad for performance and consistency, as the same logic is duplicated across the project leading to unnecessary compute and business logic drift. A great primer on how to adopt a modular approach to your dbt project can be found here and here.

CI can play an important role in preventing this spaghetti DAG from forming in the first place. Rules can be defined and applied in CI to block violations from making their way into the codebase, keeping project structure intact. For example, it’s best practice to create a single staging model for every source — a script can see if this relationship holds by parsing the manifest associated with the proposed code change in CI.

At Super, we run a few relationship checks on every commit to make sure the pull request conforms to our project standards. These include:

1:1 relationship of source to stg model
fct/dim models only depend on stg models
mart models only depend on fct/dim models

While a few early projects attempted to create tools for this, dbt Labs’ recently released dbt-project-evaluator is the most advanced open source package yet, with a plethora of rules to cover even complex use cases like rejoins and model fanouts.

Auditing and validation

The above checks help ensure that code is clean, readable, well-tested and well-documented before it gets merged in, but the developer still needs to validate the raw business logic they are adding or modifying. QA can be a tedious process especially when an existing model is being changed. Unit tests can be added to affected models, but they are fairly cumbersome to write and don’t capture downstream implications!

A thread recently cropped up on Twitter detailing this exact problem, which led to a bunch of interesting responses:

Many repliers talked about auditing the results with varying levels of sophistication. Some check primary keys and row counts, while Sarah Krasnik wrote:

Any column that’s changed, I do before/after stats if it’s numeric like mean/median/max/min and make sure the difference is warranted. If it’s a string but has <10 unique values, do counts for all the values and again make sure the change is warranted. Document all in the PR!

At Super, we use a paid tool called Datafold to automate much of the QA process. Datafold runs a detailed diff for every dbt model (and child model) touched in a pull request. It appends a merge comment to the PR with a preview of the full results, like the one seen below:

Datafold makes it easy for developers to investigate the final output of their models and metrics, and can be a powerful tool for change management since it compares new versions of models with the existing production ones. However, if you are looking for free or open-source options, the dbt audit helper package is a great alternative. This package contains macros which perform diffs between models and tables, and it can be run manually or hacked into CI using some creativity.

Lineage

One of the most important aspects of change management for dbt is managing all the downstream implications outside of dbt. When someone makes a pull request to our dbt repo, they should be able to easily identify all the downstream assets that might be affected, like dashboards, visualizations, and reverse ETL syncs. Exposures were introduced into dbt for this exact reason, but creating and maintaining them requires a significant amount of manual work.

At Super, we don’t have an automated solution for this. We maintain a list of repos and services developers should check so they don’t break anything with a dbt model update. There are signs vendors are beginning to take note, as shown by the recent partnership between Datafold and Hightouch, which surfaces downstream Hightouch syncs in Datafold’s pull request comments — but we would welcome an open source solution to this problem too.

Conclusion

In this post, we broke down our CI process for dbt projects. In order to keep our codebase clean and bug-free, we run and test all new or modified models, enforce test and documentation coverage, lint the SQL code, check that the new code is consistent with our project structure, and finally surface data quality checks and diffs to facilitate QA.

If you want to join the Super team — check out our careers page.