dbt best practices in action at Cal-ITP’s data-infra project

Dave Flynn
In the Pipeline
Published in
8 min readApr 3, 2024

Research by Even Wei, article by Dave Flynn.

dbt adoption is growing, projects are getting bigger, and more fingers are now in your data pie than ever. Your job of maintaining data quality and prod stability in such circumstances isn’t easy, which is bringing the topic of dbt/data project best practices to the forefront.

dbt Best Practices In-Action at Cal-ITP

dbt best practices in action

The best way to QA updates to your dbt project is to standardize the process of reviewing and integrating code, and therefore data, changes into your project. The pull/merge request is at the center of the review process.

What does a pull request in a dbt project look like when a data team uses best practices to manage that process? What things do they do? and what CI automation do they run to assist in the process?

Cal-ITP Data-Infra dbt project

In the following article, we’ll look at the Cal-ITP data-infra project, the project behind the Caltrans initiative to improve “trip planning and payment across (public transportation) modes and across services in California” (ref).

The Cal-ITP project has almost 400 dbt models and covers many aspects such as payments, schedules, stops, and even translations (and I’m sure much more than I can grasp from just the model names). We can’t see the amount of data because the data warehouse is not public but, given the complexity of the project, I’m sure it’s a lot.

Having access to a production dbt project like this can provide us with a wealth of information, which is particularly useful to teams tackling the growing pains of a scaling dbt project. It gives us an insight that would usually only be available behind closed doors, and let’s us see how a professional data team actually operates. (Note: Mattermost is another such high quality public dbt project, if you’re looking for more.)

A template for success

The first think you’ll notice about the cal-itp pull requests is the comments all use a structured dbt pull request template. The initial PR comment has the following sections:

  • Description — The aim of the PR, including supporting information.
  • Type of change — Feature, refactor, bugfix etc.
  • Testing proof — How the team knows it’s safe to merge.
  • Post-merge follow-up — The actions that are required after merging.

Here’s an example PR comment:

Example PR comment from the Cal-ITP data-infra dbt project
Example PR comment from the Cal-ITP data-infra dbt project

This type of description makes the reviewer’s job so much easier. All of the context for the change is provided, minimizing unnecessary delays through requests for clarification.

Automated warehouse report

In addition to the above information, Cal-ITP also has an automated ‘Warehouse Report’ that runs on every PR. The warehouse report shows:

  • A list of new models and recommendations to check for
  • A lineage DAG of modified models, which includes a color-coded legend identifying the resource type and how to materialize models of certain scale or with multiple children
The automated ‘Warehouse Report’ that runs on pull requests in the Cal-ITP dbt project
The automated ‘Warehouse Report’ that runs on pull requests in the Cal-ITP dbt project

Each PR follows this structure, so we can see that the project maintainers have implemented standard practices that all contributors must follow. This is the way to ensure that PR authors perform their due diligence on each PR and eliminate disruption to prod data.

The use of a PR comment template coupled with an automated report, means that reviewers get a mix of high-context information, together with automated data points that they feel relevant for all PRs.

Real-world dbt pull request examples

To show how this process works in action, let’s focus on a couple of PRs, each showing a different use-case.

  1. Refactoring job: Refactor payments rides for code readability and modularity: device transactions portion #3159
  2. Bugfix: Littlepay Table Deduplication Adjustments #2993

1. Validating a refactoring job

The PR: Refactor payments rides for code readability and modularity: device transactions portion#3159

Refactoring is a common job for data projects. Maybe you’re optimizing or simplifying some models, cleaning data, changing sources, normalizing data. It could be any number of things, but one thing that shouldn’t be impacted is the data. The resultant data should remain the same for a refactoring job, and that can prove a unique challenge to demonstrate.

dbt data due diligence

The comment for this PR is a textbook example of dbt data due diligence.

  • Previous and resolved issues are linked.
  • The work is clearly described with reasoning.
  • The reviewer is tagged with questions about changes.
  • Post-merge actions are defined.
A textbook PR comment from the Cal-ITP dbt data project
A textbook PR comment from the Cal-ITP dbt data project

One of the most notable sections in this comment is the use of a query for both production and development schemas as proof that this refactoring job did not impact the data:

Validating a refactoring job by comparing production data with that of a development branch
Validating a refactoring job by comparing production data with that of a development branch

This kind of query diff is ideal for validating a refactor job, when the data should not change; and it can even be used for situations when you want to validate data when impact has occurred (more on this below).

2. Bugfix for an accidentally dropped column

The PR: Littlepay Table Deduplication Adjustments #2993

This is an interesting case. What actually started as a partial solution to clean duplicate data, actually resulted in a dropped column. The PR was merged with the missing column, resulting in a bug fix PR later the same day.

The issue occurred as the geography column was missed in an explicit list of columns in the following model:

The geography column was added back in PR #3013.
The geography column was added back in PR #3013.

Follow the issue trail

Following the trail of GitHub issues in the cal-itp repo shows the benefit of the detailed pull request comment by each contributor, and how efficiently the team can react when an issue does occur: 2993 was created to address 2945 which accidentally caused 3012 that was resolved in 3013.

Column issue aside, there’s a ton of discussion in this thread and a lot of in-depth review about the cause of the duplication and the dimension of the data. It goes to show that even with such a detailed review, it’s still possible for a small bug to sneak into prod. It really underlies the importance of comprehensive pre-merge checks in dbt data projects.

How to validate PRs properly

The Cal-ITP data-infra project shows how a mix of high-context validations by the PR submitter together with supplementary checks by CI workflows provides the most comprehensive coverage for impact assessment in data project PRs.

This is the concept behind Recce, the data validation toolkit for dbt data projects. Recce provides a way for your data team to check their work using a suite of validation tools. The key concept being based around diffing data between two dbt environments, such as dev and prod, so you can see exactly how the data model changes affect the data.

Here’s a few ways Cal-ITP could use the tools in Recce to make their already extensive process even better.

Understand the scope of change with Lineage Diff

The Cal-ITP automation already includes a lineage snapshot, however, when there are new models added, they don’t seem to be included in the screenshot. In that sense, it’s not a true Lineage Diff of how the DAG has changed. Being a large static image also limits the usefulness — I found when I was checking the PRs it was difficult to zoom about the image and cross reference with the code and PR comment.

Using the exact pull request from the refactoring job above, here’s what the Lineage Diff looks like in Recce:

Lineage Diff of Cal-ITP PR #3159
Lineage Diff of Cal-ITP PR #3159

The Recce Lineage Diff shows exactly the parts of the DAG that have changed in the PR when compared to prod (or any other env).

Here’s the full DAG zoomed out (only half shown) with the changes highlighted:

The full lineage of the Cal-ITP dbt project

Validating a refactoring job

Lineage Diff also acts as the main interface in Recce for performing other data checks. Click any model to view the schema diff or perform a row count diff — Two important stats that help confirm that data has not changed.

Schema and Row count Diffs in Recce
Schema and Row count Diffs in Recce

One query, two envs

When it comes to validating a refactoring job, we saw that Cal-ITP data engineers would perform a query on models across two schemas, and then compare the results to confirm no data had changed. With Query Diff in Recce, you only need to write the query once and the results will automatically be compared for you.

Query Diff in Recce

Write one query that runs on both dbt environments, and shows you the comparison. In the screenshot above, base is prod, and target is the dev schema. There’s no need to manually query two schemas yourself and try to eyeball the results side-by-side, it’s all handled in Recce for you.

If there was impact, you’d be able to see that, too:

Recce Query Diff compares the results of a query across two dbt environments
Recce Query Diff compares the results of a query across two dbt environments

Detecting a dropped column

For the PR above with the missing column, if the Lineage DAG looked like the screenshot below, with a red icon on the stg_littlepay__device_transactions.sql model, it would be difficult to miss the impact.

Cal-ITP lineage DAG showing schema change detected in stg_littlepay__device_transactions.sql
Cal-ITP lineage DAG showing schema change detected in stg_littlepay__device_transactions.sql

High level checks

There’s also a set of high-level checks to run data profile diffs and column value diffs on specific models. These are useful for getting an overview of how your PR branch differs from your baseline. You would use these before jumping into any in-depth data spot-checks.

Histogram Diff and Value Diff are great ways to get spot data impact, including data edge cases that could result in silent errors.

High level data profiling checks in Recce
High level data profiling checks in Recce

Check the Recce Docs for the full list of data impact assessment validations that are available.

Do you want to implement a comprehensive PR review process?

If you’re looking to improve the PR review process in your dbt data project, we want to talk to you. Recce is open-source and available now.

Book a meeting to chat with us, or send us a message in #tools-recce in the dbt Slack (Here’s where to join).

--

--

Dave Flynn
In the Pipeline

Dave is a developer advocate for DataRecce.io — the data modeling validation and PR review toolkit for dbt data projects