Check Yo’ Data Before You Wreck Yo’ Results

Jung-Hee Oh
Jan 24, 2020 · 5 min read

The first few assertions and gut-checks we always perform, lest we undermine all of our credibility

As a team that has existed in its current state for less than 2% of the ACLU’s 100-year history, ACLU Analytics continues to spend a lot of time building data pipelines and tables for analysis- and report-ready data. The engagement analytics group helps the rest of the organization evaluate whether the efforts we put into fundraising and outreach are working as we had hoped, and that requires a lot of transformation of raw data that come in from our internal systems. A 100-year history undoubtedly means that our data systems have undergone a few data migrations and currently operate with several source systems to unify in reporting tables. With this, it becomes imperative to do sanity checks before we start working with the data and (more importantly) trusting results.

In a prior post, Brooke walked through how the legal analytics group works with manually-entered data sources that come from publicly available or FOIA’ed data. Working with machine-collected data (e.g. payment systems with our fundraising team) on engagement analytics presents a slightly different set of challenges, and sometimes follows a different workflow. How do we approach our internal data in a way that ensures we can trust our results?

1) Build assertions to check assumptions on data quality

Brooke talked briefly about using assertthat and assertr– using assertions should apply to all data workflows, everywhere, all the time. For example, when looking at donations, we expect to have information about how the donation came in. Did the donor encounter a canvasser on the street? Were they moved to donate after visiting our website? As this information is collected through several different source systems, it may get lost somewhere in the integration process.

With the sample data frame defined below, using a single line of code like assertthat::noNA(df$source) will return a helpful FALSE on our dataframe. To take it one step further, the verify function in the assertr package lets you see individual observations that fail your condition, as demonstrated in line 7 in the code below. Building small assertions like this into your code can help quickly check whether your assumptions about the data hold, and helpfully break your code if they don’t.

2) Check your join results

Working with our internal data may also mean that we want to look at several different dimensions in conjunction with each other, requiring complicated table joins and window functions. Like any real-world data source in existence, a new (or new-to-me) table may have assumptions built in to the data model that aren’t immediately obvious in initial single-table explorations, despite our best efforts of building in assertions like the one above. As tedious as it can be, we check that a join behaves as expected whenever we perform one. This is a first defense against wacky or overblown results.

Let’s consider a trivial example with the tables donation_history and donation_source, defined in the gist below, and the joined table as joined_df.

As we work with our trusty left join, the first few checks we perform include:

a) Does the join have the same number of rows as the original left-hand table or did the data structure of the right-hand table create new rows? Depending on how you use your data, having multiple values in your right-hand table for a unique key in your left-hand table will result in over-counting, and possibly significant inflation in your results.

b) How much of the right-hand table of the join falls away in the left join? You may be losing more information than you think from the right-hand side, and this may signal false data assumptions that need to be addressed before you proceed.

c) How many observations from the left-hand table have a corresponding value on the right-hand side? Conversely, the right-hand table may not be the full source of information that you thought you had (like too little history in the right-hand table, for example). It may also suggest that you should take another approach if the information from your right-hand table isn’t complete enough.

d) Do your assumptions about a well-known categorical variable hold? Are there new values you were not aware of, or is something being coded differently in the back end?

There are perhaps less laborious ways to do the above checks, but we’re still building our data pipelines and definitions at the ACLU. Accordingly, our reporting tables get updated from time to time, and expected data values or structure might change with these updates. Writing assertions into our production code ensures that we’re not building assumptions into our data processing when change is a given. Building in an assertr pipe like the above will break your process and give a full report of where the assumptions are no longer valid.

3) Gut-Checks in Context

In addition to the above engineering tasks, it’s often necessary to validate against historical knowledge in order to gain credibility in building your source of truth. The most important first data validations we perform ask whether data in our final table match aggregations of well-known metrics reported in a commonly-referred source.

For example, after joins and other transformation logic are applied to our table, does an aggregation like the Monthly Revenue Aggregation above return the same results as in a productionized monthly KPI dashboard? If there’s a difference, how bad is the problem? You should always investigate deltas, but you may or may not have to reconsider your approach whether the percentage difference is 0.1% or 10%, depending on your use case.

To return to our donation source example, another natural gut-check includes checking distributions of well-known categories. Our development department is well-versed in how much of our revenue comes in through which channels, so a quick check like the above will help double-check whether something strange is happening in your data transformations.

And finally, to return to our missingness example, missing entries in a field are often an inevitability that we live with in real-world data. A helpful check to assess whether missingness grossly misrepresents our results is to quantify the severity of the problem. What level of missingness are we willing to live with? This question can only be resolved in context.

Our data engineering team has worked tirelessly to build in sanity checks like missingness and uniqueness in our ETL processes by using DBT in the production code. We’ve seen firsthand that folding in a QC process during analysis with the context of organizational knowledge enhances our data integrity as we strive to build a source of truth for internal reporting, analysis, and modeling.

ACLU Tech & Analytics