CodeX
Published in

CodeX

How to Check Data Quality in PySpark

Using deequ to calculate metrics and set constraints on your big datasets

Photo by Prateek Katyal on Unsplash

We have all heard it from our coworkers, our stakeholders, and sometimes even our customers — what is going on with the data?

What if instead of hearing it from others we could set up some checks and constraints and identify the problems before our data consumers see it? What if we could do that on…

--

--

--

Everything connected with Tech & Code. Follow to join our 900K+ monthly readers

Recommended from Medium

Can LSTMs Predict Stock Prices? — A Complete Analysis (Part 1)

Need consultation on Data Science?

The shape of chemical functions

K-Means Clustering Project — Banknote Authentication

A Complete Introduction To Time Series Analysis (with R):: Best Linear Predictor (Part I)

Data, Machine Learning and Marketplace Optimization at Upwork (Part 1: User-Level Growth)

Evaluating New Airline Revenue Models- Infographic

How I built a forecasting model to predict Covid-19 new cases and vaccination numbers

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sarah Floris

Sarah Floris

A little bit of everything, focusing on data science and engineering.

More from Medium

Data Reconciliation

Apache Spark Cache and Persist

What is well-modeled data for analysis?

Big Data File Formats: Introduction to Arvo, Parquet, and ORC file