Don’t skip data validation before an ML model! Try ‘Great Expectations’, a python library.
Data quality is so essential in data science projects!
Oops! The model didn’t work out well.
Ohhhhh! There is something wrong with my dataset.
This is the everyday routine that happens to most data scientists.
Data process and data quality is always an unavoidable topic in applying models to productions. When we make mistakes in preparing the dataset, the pipeline in the following of course WON’T work appropriately. Can we find a tool to make sure the quality of our dataset? This tool could help our data science pipeline turns out some valuable information.
Great expectations! This tool is an expert in doing data validation, including basic data understanding, information discovery, seamless fit and essential security.
We created a Guided Project on CognitiveClass (a free online tech learning platform used by millions of learners) to introduce the basic usage of Great Expectations. The project is designed for beginners, thus even my grandmother can do it! In the project, we provide you line by line code to deal with the real business bank churn data.
You can run the code immediately in the platform without any setups.
We focused the basic usage using Python and Jupyter notebook here, for more complicated usage of Great Expectations, you can check its instruction website.
Great Expectations could cover the following four basic common function categories:
- Data import and DataFrame
- Content format, Existence
- Value & Statistics information
- Save the pipeline for future data validation
Great Expectations could perfectly deal with the tasks above with simple one-line code. We introduced how to call and use the functions of Great Expectations in the guided project.
For example:
Importing data:
# importing dataset
import great_expectations as gx
df = gx.read_csv($storedfile_or_datasetlink)
# calling the dataset
df['NumOfProducts'].value_counts().plot(kind='bar')
The format and usage of DataFrame are similar to the famous pandas DataFrame (don’t know pandas? Click here)
For the value range:
df.expect_column_values_to_be_between('Age', min_value=18, max_value=85, mostly=0.99)
Pretty easy right?
If you want to know every detail about how to use great expectations, you can go to the guided project to check it out. You can also learn a lot of FREE courses and projects about data science or any other technology topics from CognitiveClass.