Histogram overlay charts for data impact assessment in dbt just got a whole lot easier

Dave Flynn
In the Pipeline
Published in
4 min readMar 20, 2024

Data profiling stats are a really efficient way to get an understanding of the distribution of data in a dbt model. You can immediately see skewed data and spot data outliers, something which is difficult to do when checking data at the row level.

Using histogram and top-k charts to understand data change

Profiling stats become even more useful when applied to data change validation. Let’s say you’ve updated a data model in dbt and changed the calculation logic for a column — how can you get an overview of how the data was changed or impacted? This is where checking the top-k values, or the histogram, of before-and-after you made the changes, comes in handy — But there’s one major issue…

Featured image showing side by side histogram charts compared to a histogram overlay chart
The best way to visualize data change in a histogram chart

Something’s not right

If you generate a histogram graph from prod data, then do the same for your dev branch, you’ve got two distinct graphs. The axes don’t match, and it’s difficult to compare:

Side by side histogram charts as an example of why they are difficult to compare
Side-by-side histograms are difficult to compare

You might be able to spot some differences, such as at the top end of the graph, but the overall impact is mostly hidden.

The same is true for top-k. Cross-referencing categories might be doable when there are only a handful, but it’s still not a meaningful way to visualize the differences.

Top-k is also difficult to compare visually
Cross referencing top-k categories can also be difficult

There’s a real possibility that you’ll miss some edge cases when you can’t compare precisely and accurately. That means silent errors, or even pipeline-breaking errors, will make it into prod. Errors that you won’t find out about until the client or stakeholder calls to asks what’s up with the data.

How to meaningfully compare histograms and top-k

The best way to diff profile stats like histogram and top-k stats, is to plot them on a single chart, overlaid on top of each other using shared axes.

Here’s the same histogram charts as the image above, but with the histograms plotted on a single chart:

A histogram overlay chart created in Recce (datarecce.io)
An overlaid histogram chart of columns from dev and prod models in dbt

In this overlay histogram chart, you can quickly see the distribution change in a meaningful way. At each value range the scale of the data change is clear. The same is true for top-k. When the values are compared directly next to each other the impact is immediately understandable.

A top-k diff created in Recce (datarecce.io)
Top-k categories compared side-by-side are more meaningful

Doing this manually isn’t straight forward. You’d need to code it in a Jupyter notebook with python and would take a lot of configuration, especially given that for dbt data projects you’ll be pulling data from two different schemas if you’re diffing dev and prod. Checking this kind of data profile diff for each PR in your project would require an unreasonable amount of work.

Easily diff histogram, top-k and other profiling stats in Recce

You can get histogram and top-k diffs, especially designed for dbt data projects, as part of the suite of data modeling validation tools in Recce.

Recce compares your development and production datasets (or any two dbt envs) and enables a visual representation of data change through multiple diffing tools and query comparisons.

A gif animation showing how to create a histogram overlay from dev and prod data in data using Recce (datarecce.io)
Generate a histogram overlay comparing two models in dbt dev and prod schemas

Improved visibility of data change for dbt

As a data or analytics engineer, the improved visibility of data change makes validating your work quicker, easier, and more accurate.

As a PR reviewer you can do your job more efficiently and confidently sign off on data changes knowing that an edge case won’t come back to bite you.

Recce is open-source and available now, so you can start properly validating your dev branch right away.

Get Recce

--

--

Dave Flynn
In the Pipeline

Dave is a developer advocate for DataRecce.io — the data modeling validation and PR review toolkit for dbt data projects