Data Validation from the Ground Up

Published in

Treehouse Engineering

5 min readMar 25, 2016

Gathering insights from data can lead to unexpected questions. No data model can anticipate everything that will be asked of the data, but complex queries can coax the data into a shape better suited to answer the question at hand. Sometimes, you end up with several joins, a few subqueries and case statements, and maybe a window function or two for good measure. Everything looks great. Then you compare it to another report and see that the total is off by a few thousand and change. The clever twists and turns of your query become a labyrinth that you must navigate in order to find what went wrong. It can take hours to tease out the clause that caused the problem and rerun the report to fix the mistake. And it’s even worse if you sent out the analysis before noticing the difference.

The good news is that you never need to face that again. Instead of leaving data validation to the end of the reporting process, check in continually along the way. Integrating data validation into the report creation process is the key to avoiding backtracking and producing reports that you can send out with absolute confidence.

Start with the Data

The first step in any analysis is understanding the underlying data. This is where data validation begins, too. Do some checks to make sure that the data is what you expect. When evaluating an experiment, check that participants were placed into segments according to the proper criteria and that the proportions of the different segments are correct. Find out what values can be in fields that you plan on building conditional statements around. Most importantly, verify that you understand all of the fields and how they are generated.

Building the Query

Once you have a sense of the structure and origin of the data, it’s time to start writing your query. Instead of jumping in and writing complicated blocks of clauses, querying with data validation starts simple and builds up one layer at a time. For analysts new to building queries, this approach will cement your familiarity with the way that SQL works. Even if you are seasoned in SQL, this is a good practice to follow to keep mistakes from happening and to alert you to any quirks that you may not have encountered before.

Make an Outline

It is helpful to map out the query before you start. This will help you to identify the basic structure that you’ll work towards as you build up the clauses. I often write outlines directly in the SQL editor and comment them out so that I can keep the clauses organized in the outline. A welcome side effect of this approach is that you can leave in your outline and have a well-commented query at the end.

Find Your Anchor

The center of the query depends on the basis from which it’s built. It should be a very basic summary of the most important metric in the report. For example, if I am working on a report showing which sources Treehouse students come from before they sign up for a trial, my center is the total number of trial signups over a certain time period. If you have a complex report in mind, you may have multiple anchor points — a count of users, a sum of visits, and so on. If you expect the query to return one row per user, select both a count of the user_id and a distinct count of the user_id to verify that there is only one row for each user.

The query that you are building will be a subquery from which you pull the anchor metric.

Here’s a basic example of the structure:

select count(distinct user_id),count(visit_id) from(-— This is the main query that you are creatingselectuser_id,visit_id -— ,-— Plan the other fields that you need herefrom users ujoin visits v on u.user_id = v.user_id
-— Plan your joins here-— Plan your filtering clauses here)

Build the Layers

Next, you can add one clause at a time to fill out your outline. Check your anchor metric(s) after you add each clause to make sure nothing has changed unexpectedly. Sometimes, you want the total to change — for example, if you’ve added a where clause. In this case, you can check total counts to make sure the clause is filtering correctly, and use the adjusted anchor point from then on.

Finding a Difference

What happens if your anchor metric changes when you didn’t expect it to? The good news is that you have the list of suspects narrowed down to one clause. Look at the clause carefully and see if you can rework it to avoid the problem. If you’re not sure why the clause is causing trouble, go back to the basics. Select the actual rows that are missing from the version of the query that returns fewer rows by using MINUS in between the two versions or by using a subquery to filter out the rows in the smaller query. You can then identify a common thread between these rows that is causing them to be mistakenly filtered out.

Checking Back In

All your clauses are in place and, using your anchor metrics, you’ve verified that it’s correct. Your report is ready to go, but it’s not time to say goodbye. If it is a recurring report, check it periodically against its anchor in case changes in the underlying data require an update.

A Solid Foundation

Once you have created a report in this way, you can be confident of its accuracy. You’ve also made a new anchor that you can use in future reporting, either by adjusting the query that you have already validated or by using the report to double check the values that you are generating as you build up other reports. Your reports will give different insights, but because they are all built on the same fundamentals and because you have checked them against each other, you will be able to use them together to show a cohesive picture of your data.

Danielle is a Data Scientist on the Engineering team at Treehouse. We’re on a mission to design, build, and maintain the best online learning service in the world. Sound awesome? Join us!

If you enjoyed this post, we’d be thrilled if you clicked the ♡ below! And be sure to give the Treehouse Engineering publication a follow. Until next time 👋