In the Pipeline

Articles for data and analytics engineers

It’s not a data issue, it’s you(r SQL)

--

As a Product Manager at Recce, a tool for validating data, I’ve conducted hundreds of interviews with data practitioners to understand their daily workflows.

When discussing data quality I specifically ask: “How about data? How can you be sure the data will be correct?“

The most common answer is: “We check the logic, it should be fine. If anything has an error, someone will tell us.”

But are you sure someone will actually report the issue?

  • A data team lead at a social media platform recently discovered that the engagement metric had errors from the beginning, yet no one had reported it.
  • Remember, 60% of dbt projects have no tests at all.

The second common answer is: “We do spot checks. I write queries to validate.“

My follow up question is always: “how do you decide what query you should write?”

The answer: “it depends. It’s case by case.”

In other words, validation is entirely manual and relies on a teammate’s institutional knowledge. There is no systematic approach.

At Recce, we care about data development

Recce is deeply committed to preventing bad merges into production by bridging the gap between technical changes and business impact. We recognize that data development differs from traditional software development, and that code logic testing alone is insufficient. Often, correctness isn’t well-defined until we fully understand the context and the data itself.

As a best practice, comparing your data changes with a known-good baseline is the best way to gain visibility into data impact from logic changes. Once you’re able to recognize data impact, you’re equipped to verify if it’s desirable or not. It’s this process of verifying and recording impact that provides the proof for reviewers to integrate new code into production.

Ad-hoc data validation takes how many steps?

Imagine this: you get a request to understand if a customer has ever had a promotional order. To do so, you rolled out a change — add a new column to identify if a promotional code was used when placing an order, and identify customers who have used promotional codes in their orders (you can view the example in PR#44).

You find out the two models you need to change are

stg_payments:

stg_payments.sql

customers:

customers.sql

By the nature of the change, you know that the count of and amount of orders and customers should not be impacted. And, following the change, you’ll be able to segment the customers model into two groups - those who have used a promotional code, and those who have not.

To validate the data following the above SQL changes, you would have to run several ad-hoc queries to compare models in your data warehouse. You know in some tables the numbers should remain the same and the answer should come from customers table.

Here’s what actions you might take to validate your work:

  1. Review the code changes in two files
  2. Run row count diffs for the two changed models
  3. View the project lineage in dbt Cloud Explorer to find out which downstream models you should check: the downstream models are orders, customer_segments and customer_order_pattern.
  4. Run row count diffs on the three identified downstream models
  5. Run a custom query to check customers who have used a promotional code can be identified
select has_promoted_orders, count(customer_id)  from {{ ref("customers") }} group by 1

The above validation work constitutes nine total actions, and that’s before you factor in the following work:

  • Running queries on all downstream tables, which takes significant time.
  • Paying for every query- are all downstream tables really necessary?
  • Manually comparing dev and prod query results, relying on eagle-eyed precision. 🦅 👁️

This process wastes time, money and mental energy.

Validate with Recce in just three actions

Let’s consider the same code changes, but this time use Recce to validate it. The first view of the Recce is the lineage diff, which at a glance provides a view of all modified and downstream models.

The first view when you launch Recce.

With just one click you’re able to perform a row count diff of the modified models. There are already some potential issues, since the row counts should remain the same.

See the row counts diff results directly on the lineage graph.

Toggling Breaking Change Analysis enables you to rule out two models: stg_payments, which is a non-breaking change, and its downstream model, orders.

Toggling Breaking Change Analysis to see the analysis result directly on the lineage graph.

In just three actions, you’ve discovered a problem — you’ve lost 3.7% of rows from the customers models and downstream models. Now that you've identified the problem, let’s dive deeper.

Debug data faster, smarter, and without the pain

The issue starts with the customers model.

Since the row count has changed, you may suspect a filtering issue. To investigate, let’s run a profile diff of the customers model.

The results show a distinct count of ‘1’ for the has_promoted_orders column. However, you know that there should be two values here, true and false, to represent those customers who have, and have not, used promotional codes.

The profile diff result of the customers model.

To find out what that single value in has_promoted_orders is, you can use the query feature:

select distinct has_promoted_orders from {{ ref("customers") }}

The results show you that there is only one value, “true”, which means you’re missing any customers that have not used a promotional code, and would have “false” in this column. There must be an error in the changed model code.

the query results

After checking the code diff in Recce, you can see that you’ve accidentally filtered customers with a where clause by adding line 68.

code diff

It only took three actions to find out the root cause. 🎉

Start Recce Today

I hope this example demonstrates how you can use Recce to validate your changes efficiently. Instead of relying on spot checks and manually scanning results, use Recce to optimize your time and work on what matters the most: delivering accurate answers.

--

--

In the Pipeline
In the Pipeline
Karen Hsieh
Karen Hsieh

Written by Karen Hsieh

Data📊 Empower 🙌 Product 💜

No responses yet