Speed up pull request review for SQLMesh data projects
SQLMesh is a powerful ELT platform that brings a lot of features to help you build and iterate on data pipelines. One of the most interesting features is virtual data environments, which speeds up pushing changes to production by reducing unnecessary model rebuilds. Wouldn’t it be great if you could also speed up other areas of your project, like PR review?
Best practices for data projects
If you’re using SQLMesh, then you already know the benefits of adapting software engineering best practices to data projects — Your code is version controlled, developers have their own schemas, you have testing in place. Even so, without proper PR review, bad merges can still occur leading to silent data errors, and firefighting after merging.
Reviewing PRs for your SQLMesh project can be a challenge, so here are some things that you can do to bolster your QA process to catch issues before merging.
Get a PR comment template
First, all contributors to the project should use a standardized PR comment template. A good template will cover all the things you expect each contributor to do when submitting a PR:
- The purpose of the PR: What they are doing and why.
- The type of change: New model, bug fix, refactor etc.
- Data checks and validations: The impact assessment to verify the new SQL has the intended impact.
The first two points set up the review for what they should expect to see in this PR. Then the data checks/validations are the proof that this was achieved — this is why performing a data review is essential. Without it, a proper review has not been done.
Review the data
When you come to review the data from a development branch, either as the author or the PR reviewer you must have some expectation of the results you would expect to see. Each PR has an intention and an expected impact.
- Intention: What you want to achieve
- Impact: The impact you expected to see
This pretty much maps exactly to the PR comment sections above, so you can see why it’s so important to follow a structured PR process.
Compare prod to dev
The best way to assess the impact of your data modeling updates is to compare how the data compiled from your branch compares to production data (your known-good baseline). There’s a feature in SQLMesh called table_diff that helps you to do this. It’s a great way to get an insight into the structural difference of two tables, or even --show-sample
to see some of those differences.
Actionable insights
What you need to do is operationalize this kind of insight, so it can be actioned during review. The ideal solution would be to turn it into a data check that you can:
- Save for future reference
- Repeat to re-validate after data changes
- Share with the team for review or discussion
This is exactly what Data Recce was designed for.
Operationalized Data Validations
Recce is a data modeling validation tool for enhanced PR review.
With Recce you can:
- Save your data checks as you work in a checklist
- Re-run checks
- Annotate checks to add context to your findings
- Export checks for use in your PR comment
- Share checks as a reproducible environment
Lineage Diff
Lineage Diff in Recce is a way to visualize the parts of the DAG that have changed when compared to prod. When you’re checking data, Lineage Diff helps to focus your area of inspection to just the modified models or those downstream.
Lineage Diff provides a gateway to understanding data change and the jumping-off point to check models which you deem critical.
Query Diff
To compare the actual data, you can use Query Diff, which runs the same query on both dev and prod and diff the results.
You can run any query here that you like to help you investigate data change better, including joins, and exploring by dimension, and Recce will automatically diff the query results for you.
Checklist
Save all your checks to the checklist, where you can rerun or export them to share as part of you pull request review. Added annotations to give context to the data check results.
Create awesome pull request comments
The markdown generated in Recce is ready to be pasted into your PR comment as proof-of-correctness of your work to help the PR reviewer understand the impact of your changes and sign-off on your work:
How to try out Recce
To get started with Recce in SQLMesh head over to the documentation where’s the a quick start for running Recce in the SQLMesh Sushi example project.
It only takes a few minutes to try out, but if you’re short for time, check the online demo that allows you to take Recce for a test drive using a PR on dbt’s Jaffle Shop project.