Going All-In On Data Quality

5 min readOct 26, 2023

Introduction

It’s been way too long since I’ve written about data quality (I found a break in between composing terrible takeaway posts on major data conferences). Part of the reason that it’s been so long since my last post on the subject is that there frustratingly wasn’t a whole lot of progress on that front.

We’ve recently been working a lot more closely with a team that’s focused on putting together a centralized data quality platform within our organization. With this in place, we can really kickstart our efforts on this topic. In this post, I’ll give an overview of the major features of this platform and how we plan to take advantage of it.

Data Quality Platform: The Fundamentals

For starters, I am a big proponent of centralized platform tools when it comes to areas like data quality. Having a plethora of different implementations of the same concept leads to a lack of best practices and hacky solutions that are specific to a certain area. Instead, centralized tools should be built with the considerations of these different data teams in mind, making sure they’re always involved in the discussions of what this platform should become.

The data quality platform that we are starting work with offers two key features:

Data profiler, which will look at key stats (count, number of nulls, etc.) for critical fields on a given table and alerts accordingly if the trend is significant enough (based on Z-scores in statistics)
An Airflow operator that can run any SQL query and alert based on criticality if configured thresholds are exceeded

The data profiler is an often-overlooked feature when it comes to data quality, but I think it’s quite useful. Being able to see over time how your core fields are changing is a good basis on which to evaluate the health of your data. You can see early on if there are issues that should be discussed with upstream teams, before their effects are more impactful.

How We’ll Integrate Into The Data Quality Platform

As someone who is a big proponent of data quality, my eyes lit up when I saw this tool was coming to fruition, and I had to start brainstorming accordingly. Here’s how I see our team integrating with this platform:

Track key fields on our tables via the data profiler. We’ll create a dashboard with all this information as it’s tracked each day and set up low-severity alerts for those that we think are worth following up on.
Have threshold DQ checks in place for fields that we think are worth higher-severity alerts. What I mean by that is making sure primary keys have no duplicates or fields that should have only a limited number of values are confined to just those values. If these checks break, we definitely need to look more closely and perform some RCA.
Have threshold DQ checks in place for trend checks on KPI metrics of our tables. For example, track total visits and alert if there’s a significant drop (or rise if that’s worth alerting on) in the overall count.

As the last part requires a more complex SQL query, it’s a better fit for the Airflow operator than the profiler. One thing worth noting with those KPI checks is that you really need to have a solid understanding of your data and the overall business context before diving in head-first (which is exactly what we need to be doing more of as data engineers).

As an example, we work with a lot of sports data, and it should be no surprise that compared to other days, Sundays (and specifically NFL Sundays) are when we see the most traffic. So, does a DoD check really make sense when Sunday is going to be much more data than other days? Probably not, and something like a WoW or even a MoM check would stabilize our data more. Figure out what makes sense and then go about making it a reality.

Data Quality: Next Steps

Now that we have an idea of what exactly we want to achieve with data quality, it’s time to go about doing just that. Some features we want to consider include:

What’s worth flagging as a low-level alert vs. what’s worth stopping our process, forcing us to reconcile before proceeding further?
How can we integrate this with our data reliability engineering (DRE) team so stakeholders are automatically a part of the conversation when something signifiant fails? (For more information on what exactly DRE is, I wrote about Bigeye’s DRE Con a few months ago)

A principle that I think is useful to follow when it comes to data quality is the idea of staging tables (or signal tables as I called them in a previous post). Essentially, there’s no value in writing data to its final location and then running DQ checks, only to find that there’s an issue with the data. The damage is already done in that case.

Instead, it’s better to write the data to a staging table and then move that data to the final table after all critical DQ checks (since checks can have different severity) have passed. Yes, it’s more setup to create additional tables and have a step that actually moves data from one table to another, but the benefits are worth it. By following this approach, you make sure your final data, the data that’s used by stakeholders, is free of DQ issues.

Conclusion

On top of everything I’ve written to this point, I also recommend involving everyone in data quality conversations. This isn’t just something that engineers should tackle. Make sure product managers and stakeholders are in the loop as well, as they’ll often have their take on what should be tracked. Take advantage of their business knowledge and shape the future of your data quality accordingly.

Back when this page wasn’t a cess pool of overrated posts, I wrote A LOT about data quality. I’d recommend checking out my previous posts on DQ platforms (a series I did with Validio) as a starting point. Always happy to discuss a topic like data quality for anyone who wants to investigate this area further!