Speed up pull request review for SQLMesh data projects

Dave Flynn
In the Pipeline
Published in
5 min readMay 9, 2024

SQLMesh is a powerful ELT platform that brings a lot of features to help you build and iterate on data pipelines. One of the most interesting features is virtual data environments, which speeds up pushing changes to production by reducing unnecessary model rebuilds. Wouldn’t it be great if you could also speed up other areas of your project, like PR review?

Yoda from Star Wars: “Review the data, you must”
Without checking the data, a proper PR review it is not

Best practices for data projects

If you’re using SQLMesh, then you already know the benefits of adapting software engineering best practices to data projects — Your code is version controlled, developers have their own schemas, you have testing in place. Even so, without proper PR review, bad merges can still occur leading to silent data errors, and firefighting after merging.

Reviewing PRs for your SQLMesh project can be a challenge, so here are some things that you can do to bolster your QA process to catch issues before merging.

Get a PR comment template

First, all contributors to the project should use a standardized PR comment template. A good template will cover all the things you expect each contributor to do when submitting a PR:

  1. The purpose of the PR: What they are doing and why.
  2. The type of change: New model, bug fix, refactor etc.
  3. Data checks and validations: The impact assessment to verify the new SQL has the intended impact.

The first two points set up the review for what they should expect to see in this PR. Then the data checks/validations are the proof that this was achieved — this is why performing a data review is essential. Without it, a proper review has not been done.

Review the data

When you come to review the data from a development branch, either as the author or the PR reviewer you must have some expectation of the results you would expect to see. Each PR has an intention and an expected impact.

  • Intention: What you want to achieve
  • Impact: The impact you expected to see

This pretty much maps exactly to the PR comment sections above, so you can see why it’s so important to follow a structured PR process.

Compare prod to dev

The best way to assess the impact of your data modeling updates is to compare how the data compiled from your branch compares to production data (your known-good baseline). There’s a feature in SQLMesh called table_diff that helps you to do this. It’s a great way to get an insight into the structural difference of two tables, or even --show-sample to see some of those differences.

Table diff in SQLMesh with samples
Table diff in SQLMesh with samples

Actionable insights

What you need to do is operationalize this kind of insight, so it can be actioned during review. The ideal solution would be to turn it into a data check that you can:

  • Save for future reference
  • Repeat to re-validate after data changes
  • Share with the team for review or discussion

This is exactly what Data Recce was designed for.

Operationalized Data Validations

Recce is a data modeling validation tool for enhanced PR review.

With Recce you can:

  • Save your data checks as you work in a checklist
  • Re-run checks
  • Annotate checks to add context to your findings
  • Export checks for use in your PR comment
  • Share checks as a reproducible environment

Lineage Diff

Lineage Diff in Recce is a way to visualize the parts of the DAG that have changed when compared to prod. When you’re checking data, Lineage Diff helps to focus your area of inspection to just the modified models or those downstream.

Recce Lineage DAG Diff for SQLMesh
Recce Lineage DAG Diff for SQLMesh

Lineage Diff provides a gateway to understanding data change and the jumping-off point to check models which you deem critical.

Query Diff

To compare the actual data, you can use Query Diff, which runs the same query on both dev and prod and diff the results.

Recce Query Diff for SQLMesh
Recce Query Diff for SQLMesh

You can run any query here that you like to help you investigate data change better, including joins, and exploring by dimension, and Recce will automatically diff the query results for you.

Checklist

Save all your checks to the checklist, where you can rerun or export them to share as part of you pull request review. Added annotations to give context to the data check results.

Recce Checklist for SQLMesh
Recce Checklist for SQLMesh

Create awesome pull request comments

The markdown generated in Recce is ready to be pasted into your PR comment as proof-of-correctness of your work to help the PR reviewer understand the impact of your changes and sign-off on your work:

PR comment with Recce checks
PR comment with Recce checks

How to try out Recce

To get started with Recce in SQLMesh head over to the documentation where’s the a quick start for running Recce in the SQLMesh Sushi example project.

It only takes a few minutes to try out, but if you’re short for time, check the online demo that allows you to take Recce for a test drive using a PR on dbt’s Jaffle Shop project.

--

--

Dave Flynn
In the Pipeline

Dave is a developer advocate for DataRecce.io — the data modeling validation and PR review toolkit for dbt data projects