dbt Pull-Request Data Audit in GitHub Codespaces for Free

Published in

In the Pipeline

7 min readMay 30, 2024

GitHub Codespaces was made publicly available in 2022 and brought reproducible development environments right to your code. Codespaces is great for developers and, as software engineering best practices are embraced and adapted for data engineering, data teams can also take advantage of the benefits of codespaces.

“We’re seeing more software engineering best practices emerging into the data space, (such as) composability, repeatability, ease of set-up. This is where Codespaces helps really well
— Joe Karlsson (Tinybird)

dbt pull-request data audit in GitHub Codespaces

Code (and data) with context

The genius part of codespaces is that they are launched from a repository and branch, which inherently connects them to your code. It’s a great way to think about the Codespace environment — “I’m launching an environment to work on a specific state of my code”, which, in our case, includes data.

Versatile VMs

As Codespaces are essentially virtual machines in the cloud, you can do anything that you could do locally, but with the convenience of not managing the environment.

Review the data from any PR in your dbt project

What you’ll learn below is how to take the convenience of Codespaces and use it to review a data pull request (PR) on a dbt project. You’ll be able to take any PR in your data project, and audit that PR by checking the data that is generated from the data-modeling changes in that PR.

Why do you need to review the data in dbt PRs?

Reviewing data prior to merging a PR is essential for stopping silent errors and breaking changes from making their way into production. In the wider scope of data quality, this review, or QA, is the Audit phase of Write-Audit-Publish (WAP), a process that was popularized in a 2017 Dataworks talk by Michelle Ufford:

(Write-Audit-Publish) is a pattern to ensure that data is of a high quality before making it available to consumers…the only point people would see the data, is after we verify that it’s actually good
- Michelle Winters (Netflix, at the time of this quote)

WHOOPS, THE NUMBERS ARE WRONG! SCALING DATA QUALITY @ NETFLIX

Do you live in an ideal world?

In an ideal world you’d have a full CI/CD workflow in place that would set-up the WAP environment for you automatically. The process would go something like:

Writing a subset of data to staging table
Auditing your PRs automatically
Publish your changes after reviewing the audit

Michelle’s talk explains how WAP works and for something for more hands-on, when it comes to dbt, check out best practices to prepare data environments for testing.

A data audit for the rest of us

We don’t all have the luxury of living in an ‘ideal world’, so you might not have CI/CD prepping your environments, but you can still audit your data.

I’m sure when you see this in action, you’ll start planning a better implementation specific to your workflow and be on your way to full-on WAP, but this will give you a good taster.

Here’s how to audit your data for a PR on using GitHub Codespaces:

Follow-along demo

This demo uses DuckDB and dbt’s Jaffle Shop, so you can follow along and reproduce these steps exactly. Head to the following PR in the demo repo.

Demo pull request:
https://github.com/DataRecce/jaffle_shop_duckdb/pull/1

I won’t go into too much detail about this specific demo PR, I covered that in another article. The main point is that there is a data modeling change, and you want to check (audit) the data before merging the change.

1. Create a Codespace

Click the code button and then the green button to “Create codespace” on this PR:

It’ll pop-open a new tab and to launch the Codespace, and when it’s finished you’ll see the VSCode interface.

You should feel at home if you’ve used this editor before. Don’t worry if not, we won’t be in here for long, and we’ll be running mostly dbt commands.

2. Prep base and target environments

You’re going to do two things here:

Build the models from the production code (base)
Build the models from the branch code (target)

This is essentially the ‘write’ stage of WAP. For the purposes of this demo, instead of building a subset of data into a staging schema, we’ll just using prod as the base for the comparison.

Close the dbt Task and click on **bash** to open the terminal

Close the dbt Task and click on the bash link to get to the terminal.

Checkout the main branch:

git checkout main

Install dbt dependencies:

dbt deps

Build the production models:

dbt build --target prod  --target-path target-base/

Generate the dbt docs:

dbt docs generate --target prod --target-path target-base/

Important: Note that we’re specifying a --target-path of target-base for both of the above commands. These dbt artifacts will act as the base for our audit comparison, so we store them in a special folder. Also, docs are required for this demo, so don’t skip it ;)

Switch back to the branch

git switch -

Build the branch models and generate dbt docs:

dbt build && dbt docs generate

Note: There’s no need to specific a --target-path here. The default target is dev, and the default folder for artifacts is target which will be our audit comparison target.

3. Install and launch Recce

Recce is the the tool we’ll be using to compare the prod data the branch data and perform our data audit checks.

Recce automatically looks for a target-base folder to use as the known-good baseline, and the compares it with the project that was into target. That’s why the previous environment set-up was so important.

pip install recce

Start the Recce server:

recce server

The Recce server will start and you’ll notice a pop-up in the bottom right asking you to open in browser.

Click the **Ports** tab, and then open the **Forwarded Address** URL

You can also access the server from the Ports tab and clicking the little globe icon under Forwarded address

4. Data Audit Environment

The Recce UI is essentially your audit environment. Any data check you run in here will run against both production and development data, and show you the difference.

When you run this on your own data, the things you check will depend on the specific type of change you made — refactor, bug fix, new models, columns etc. For this PR it’s a logic change, and we’re comparing directly to prod, so we don’t expect to see any difference in row count or schema.

Row count check

Perform row count checks on multiple models to ensure now data was lost from data modeling changes:

Data profile diff

Other checks like data profile diff can give a high level impact assessment of data change.

Query Diff

Query diff allows you to audit data at the row level, it’s a fine grained comparison to spot check specific rows and see the impact.

Data drill-down

For a more in-depth look at using Recce to audit code and data change for dbt PRs, check out this article:

Hands-On Data Impact Analysis with Recce

Perform impact analysis checks to validate business logic changes or bug fixes with Recce by diffing two dbt…

medium.com

Improving the process

There are steps you can take to streamline this process.

Connect to your data warehouse

The demo above uses DuckDB for simplicity but, of course, you’d want to connect to your specific data warehouse, BigQuery, Snowflake, Redshift etc. So, you would need to handle that connection.

Your warehouse credentials can be added and used like environment variables with GitHub secrets. If your organization configures the secrets for the repo, then it wouldn’t be necessary to do on a per-team member basis.

Create a initialization script

If you have a lot of dependencies to install each time you launch a Codespace, then you could create a script that installs everything for you. Commit it to the repo, and then run it each time you launch a Codespace.

Use a Codespace Pre-build

Alternatively, and if you have a lot of stuff to install, then you could use a Codespace Pre-build. Pre-builds are ready-made images with all of your dependencies pre-installed. Just launch the Codespace and you’re ready to go.

Data best practices

If you’re new to getting started with PR review for dbt data projects, there are some best practices that you can follow to prepare your environments, such as:

Using Schemas to manage your environments
Prep a staging environment instead of directly comparing to production
Limiting the data range used when preparing environments
and more

Is GitHub Codespaces free?

Codespaces has a really generous free tier. At the time of writing, you can get up to 60 hours per-month free for individuals. And even if you use it in your team you can still take advantage of the free tier by setting your Codespaces Ownership to User Ownership. It’s a great way to get started and try it out.

Let me know if you try out the above steps and have any feedback!