dbt best practices in action at Cal-ITP’s data-infra project

Published in

In the Pipeline

8 min readApr 3, 2024

Research by Even Wei, article by Dave Flynn.

dbt adoption is growing, projects are getting bigger, and more fingers are now in your data pie than ever. Your job of maintaining data quality and prod stability in such circumstances isn’t easy, which is bringing the topic of dbt/data project best practices to the forefront.

dbt best practices in action

The best way to QA updates to your dbt project is to standardize the process of reviewing and integrating code, and therefore data, changes into your project. The pull/merge request is at the center of the review process.

What does a pull request in a dbt project look like when a data team uses best practices to manage that process? What things do they do? and what CI automation do they run to assist in the process?

dbt best-practices are in, but merge-times are up

Have you noticed that your time-to-merge has increased since moving your data project to dbt, or adopting dbt best…

medium.com

Cal-ITP Data-Infra dbt project

In the following article, we’ll look at the Cal-ITP data-infra project, the project behind the Caltrans initiative to improve “trip planning and payment across (public transportation) modes and across services in California” (ref).

The Cal-ITP project has almost 400 dbt models and covers many aspects such as payments, schedules, stops, and even translations (and I’m sure much more than I can grasp from just the model names). We can’t see the amount of data because the data warehouse is not public but, given the complexity of the project, I’m sure it’s a lot.

Having access to a production dbt project like this can provide us with a wealth of information, which is particularly useful to teams tackling the growing pains of a scaling dbt project. It gives us an insight that would usually only be available behind closed doors, and let’s us see how a professional data team actually operates. (Note: Mattermost is another such high quality public dbt project, if you’re looking for more.)

A template for success

The first think you’ll notice about the cal-itp pull requests is the comments all use a structured dbt pull request template. The initial PR comment has the following sections:

Description — The aim of the PR, including supporting information.
Type of change — Feature, refactor, bugfix etc.
Testing proof — How the team knows it’s safe to merge.
Post-merge follow-up — The actions that are required after merging.

The anatomy of a perfect pull-request comment for dbt data projects

Putting the code into version control is only the first step, there’s code and data-review, and most importantly impact…

medium.com

Here’s an example PR comment:

Example PR comment from the Cal-ITP data-infra dbt project

This type of description makes the reviewer’s job so much easier. All of the context for the change is provided, minimizing unnecessary delays through requests for clarification.

Automated warehouse report

In addition to the above information, Cal-ITP also has an automated ‘Warehouse Report’ that runs on every PR. The warehouse report shows:

A list of new models and recommendations to check for
A lineage DAG of modified models, which includes a color-coded legend identifying the resource type and how to materialize models of certain scale or with multiple children

The automated ‘Warehouse Report’ that runs on pull requests in the Cal-ITP dbt project

Each PR follows this structure, so we can see that the project maintainers have implemented standard practices that all contributors must follow. This is the way to ensure that PR authors perform their due diligence on each PR and eliminate disruption to prod data.

The use of a PR comment template coupled with an automated report, means that reviewers get a mix of high-context information, together with automated data points that they feel relevant for all PRs.

Real-world dbt pull request examples

To show how this process works in action, let’s focus on a couple of PRs, each showing a different use-case.

Refactoring job: Refactor payments rides for code readability and modularity: device transactions portion #3159
Bugfix: Littlepay Table Deduplication Adjustments #2993

1. Validating a refactoring job

The PR: Refactor payments rides for code readability and modularity: device transactions portion#3159

Refactoring is a common job for data projects. Maybe you’re optimizing or simplifying some models, cleaning data, changing sources, normalizing data. It could be any number of things, but one thing that shouldn’t be impacted is the data. The resultant data should remain the same for a refactoring job, and that can prove a unique challenge to demonstrate.

dbt data due diligence

The comment for this PR is a textbook example of dbt data due diligence.

Previous and resolved issues are linked.
The work is clearly described with reasoning.
The reviewer is tagged with questions about changes.
Post-merge actions are defined.

A textbook PR comment from the Cal-ITP dbt data project

One of the most notable sections in this comment is the use of a query for both production and development schemas as proof that this refactoring job did not impact the data:

Validating a refactoring job by comparing production data with that of a development branch

This kind of query diff is ideal for validating a refactor job, when the data should not change; and it can even be used for situations when you want to validate data when impact has occurred (more on this below).

So, you think you’ve got dbt test bloat?

After a certain threshold, alert fatigue becomes an issue, especially when upstream issues trigger hundreds of…

medium.com

2. Bugfix for an accidentally dropped column

The PR: Littlepay Table Deduplication Adjustments #2993

This is an interesting case. What actually started as a partial solution to clean duplicate data, actually resulted in a dropped column. The PR was merged with the missing column, resulting in a bug fix PR later the same day.

The issue occurred as the geography column was missed in an explicit list of columns in the following model:

The geography column was added back in PR #3013.

Follow the issue trail

Following the trail of GitHub issues in the cal-itp repo shows the benefit of the detailed pull request comment by each contributor, and how efficiently the team can react when an issue does occur: 2993 was created to address 2945 which accidentally caused 3012 that was resolved in 3013.

Column issue aside, there’s a ton of discussion in this thread and a lot of in-depth review about the cause of the duplication and the dimension of the data. It goes to show that even with such a detailed review, it’s still possible for a small bug to sneak into prod. It really underlies the importance of comprehensive pre-merge checks in dbt data projects.

How to validate PRs properly

The Cal-ITP data-infra project shows how a mix of high-context validations by the PR submitter together with supplementary checks by CI workflows provides the most comprehensive coverage for impact assessment in data project PRs.

This is the concept behind Recce, the data validation toolkit for dbt data projects. Recce provides a way for your data team to check their work using a suite of validation tools. The key concept being based around diffing data between two dbt environments, such as dev and prod, so you can see exactly how the data model changes affect the data.

Here’s a few ways Cal-ITP could use the tools in Recce to make their already extensive process even better.

Understand the scope of change with Lineage Diff

The Cal-ITP automation already includes a lineage snapshot, however, when there are new models added, they don’t seem to be included in the screenshot. In that sense, it’s not a true Lineage Diff of how the DAG has changed. Being a large static image also limits the usefulness — I found when I was checking the PRs it was difficult to zoom about the image and cross reference with the code and PR comment.

Using the exact pull request from the refactoring job above, here’s what the Lineage Diff looks like in Recce:

The Recce Lineage Diff shows exactly the parts of the DAG that have changed in the PR when compared to prod (or any other env).

Here’s the full DAG zoomed out (only half shown) with the changes highlighted:

The full lineage of the Cal-ITP dbt project

Validating a refactoring job

Lineage Diff also acts as the main interface in Recce for performing other data checks. Click any model to view the schema diff or perform a row count diff — Two important stats that help confirm that data has not changed.

One query, two envs

When it comes to validating a refactoring job, we saw that Cal-ITP data engineers would perform a query on models across two schemas, and then compare the results to confirm no data had changed. With Query Diff in Recce, you only need to write the query once and the results will automatically be compared for you.

Write one query that runs on both dbt environments, and shows you the comparison. In the screenshot above, base is prod, and target is the dev schema. There’s no need to manually query two schemas yourself and try to eyeball the results side-by-side, it’s all handled in Recce for you.

If there was impact, you’d be able to see that, too:

Recce Query Diff compares the results of a query across two dbt environments

Detecting a dropped column

For the PR above with the missing column, if the Lineage DAG looked like the screenshot below, with a red icon on the stg_littlepay__device_transactions.sql model, it would be difficult to miss the impact.

Cal-ITP lineage DAG showing schema change detected in stg_littlepay__device_transactions.sql

High level checks

There’s also a set of high-level checks to run data profile diffs and column value diffs on specific models. These are useful for getting an overview of how your PR branch differs from your baseline. You would use these before jumping into any in-depth data spot-checks.

Histogram Diff and Value Diff are great ways to get spot data impact, including data edge cases that could result in silent errors.

High level data profiling checks in Recce

Check the Recce Docs for the full list of data impact assessment validations that are available.

Do you want to implement a comprehensive PR review process?

If you’re looking to improve the PR review process in your dbt data project, we want to talk to you. Recce is open-source and available now.

Book a meeting to chat with us, or send us a message in #tools-recce in the dbt Slack (Here’s where to join).

dbt best practices in action at Cal-ITP’s data-infra project

dbt best practices in action

dbt best-practices are in, but merge-times are up

Have you noticed that your time-to-merge has increased since moving your data project to dbt, or adopting dbt best…

Cal-ITP Data-Infra dbt project

A template for success

The anatomy of a perfect pull-request comment for dbt data projects

Putting the code into version control is only the first step, there’s code and data-review, and most importantly impact…

Automated warehouse report

Real-world dbt pull request examples

1. Validating a refactoring job

dbt data due diligence

So, you think you’ve got dbt test bloat?

After a certain threshold, alert fatigue becomes an issue, especially when upstream issues trigger hundreds of…

2. Bugfix for an accidentally dropped column

Follow the issue trail

How to validate PRs properly

Understand the scope of change with Lineage Diff

Validating a refactoring job

One query, two envs

Detecting a dropped column

High level checks

Do you want to implement a comprehensive PR review process?

Written by Dave Flynn