How to Check Data Quality in PySpark

Sarah Floris
CodeX
Published in
8 min readMay 8, 2022

--

Using deequ to calculate metrics and set constraints on your big datasets

Photo by Prateek Katyal on Unsplash

We have all heard it from our coworkers, our stakeholders, and sometimes even our customers — what is going on with the data?

What if instead of hearing it from others we could set up some checks and constraints and identify the problems before our data consumers see it? What if we could do that on even our largest datasets?

That’s what I have come to share with you today.

deequ runs checks and constraints to identify when there are problems in the dataset. Today, I will showcase its power using google colab and an Integrated Public-Use Microdata Series or IPUMS dataset. I have provided a sample dataset and the code here.

What is deequ?

Deequ is an open-source tool that originated and is still used in AWS.¹ Deequ creates data quality tests and helps to identify unexpected values in our data; We are able to run these tests on a pandas data frame or a spark data frame.¹

Deequ is composed of these 4 main components:

  • Metrics computation
  • Constraint suggestion
  • Constraint verification
  • Metrics repository

--

--

Sarah Floris
CodeX

dutchengineer.substack.com | A little bit of everything, focusing on machine learning and data engineering