Member-only story
Getting Started with Great Expectations: A Guide to Data Validation in Python
Learn how to prevent data quality issues with a few lines of code in Python
When you work on a data science project with a company, you usually don’t have a unique test set, unlike university and research, but you keep receiving newly updated samples from the client.
Before applying the machine learning model to the new sample, you need to verify its data quality, such as the column names, the column types, and the distribution of the fields, which should match the training and old test set.
Manually analyzing the data can be time-consuming when the data is dirty and presents more than 100 features. Luckily, there is a life-saving Python library, called Great Expectations. Did I intrigue you? Let’s get started!
What is Great Expectations?
Great Expectations is an open-source Python library that is specialized in solving three important aspects to manage data:
- validating data by verifying if it respects some important conditions or expectations