Data Quality In Batch Processing

3 min readSep 13, 2022

Introduction

I came across this blog the other day on how Soda can be used within Airflow to implement data quality checks. While I’ve written about data quality plenty of times before, I never fully wrote about the implementation we had in place when I worked at Nielsen, which is quite a bit similar to Airflow’s Soda integration.

Data Quality Implementation

There wasn’t an overarching framework or platform for data quality when I was at Nielsen, so what we “developed” was merely just internal for our own usage. We decided to take an approach of having checks at each critical part of our process. We would have a library of checks that we’d run and then assign differing levels of severity to them based on the criticality of the issue. These results would be stored in a table and then flagged as necessary.

At the time, we didn’t use a library like Deequ or Great Expectations to run these checks. We just implemented them as Spark applications where we’d read in the produced data from the previous step and run the appropriate checks on the data. If any critical errors were found (duplicates when there shouldn’t be, for example), we’d fail the step and our pipeline would halt as a result.

Implementation Pros

Being able to stop your application when a critical error is found is definitely an appropriate action to take. After all, why continue processing if there’s some issue with the data, which will only get compounded if it proceeds? Having this circuit breaker logic in place is the best approach in the long run, so long as critical failures aren’t common enough to consistently delay data delivery (although that would be pointing to a more severe issue).

Storing results in a table for historical purposes is always good. This allows the team to build visualizations of the history of DQ failures, which can help point to bigger issues that need more investigation. Additionally, the design was rather flexible. Adding more rules or changing severity is just a small code change.

Implementation Cons

The biggest drawback of this implementation would be the checks themselves. They were all separate methods that would run the appropriate logic and then collect the results, storing them in a DataFrame for future reference. Using a library that essentially does this (Deequ would have made sense since we were Spark-centric) would have definitely simplified things (and we would have likely taken this approach with some more research). At the time, there wasn’t really any thought of using an existing framework, especially when these unit testing methods were already present in the code to begin with during builds.

Conclusion

Given that we developed an internal data quality solution at the time, I feel like our approach was a solid one, and that’s reflected in what Airflow can do with Soda (run checks as their own step, fail them if necessary for circuit breaking). The bigger question becomes what to do when your application isn’t in a straightforward pipeline.

When it comes to a microservice design, there’s no clear cut distinction of when one step ends and the next one begins. That’s where approaches like data reconciliation make more sense, so that data can be compared between layers and flagged as necessary. This runs on a cadence we determine since we don’t have a pipeline tying everything together.

Of course, a properly implemented data quality platform can potentially take care of all of this. After all, being able to connect to data beyond the warehouse layer will certainly simplify things.