Checking the sanity of your data using automated testing

An introduction to Great Expectations

Matias Aravena Gamboa

Published in

spikelab

6 min readApr 30, 2019

What is Great Expectations?

Great Expectations is a great tool which we can use to apply automated testing to our data. It can be used at the pre-processing stage of our data or at any stage if you feel the necessity of checking the sanity of it.

Quoting to the Great Expectations team:

Great Expectations helps teams save time and promote analytic integrity by offering a unique approach to automated testing: pipeline tests. Pipeline tests are applied to data (instead of code) and at batch time (instead of compile or deploy time). Pipeline tests are like unit tests for datasets: they help you guard against upstream data changes and monitor data quality.

We should use Great Expectations because we can:

Save time during data cleaning and munging.
Accelerate ETL and data normalization.
Streamline analyst-to-engineer handoffs.
Monitor data quality in production data pipelines and data products.
Simplify debugging data pipelines if (when) they break.
Codify assumptions used to build models when sharing with distributed teams or other analysts.

At Spike we use Great Expectations because:

We have to handle many .csv files per day.
We spend a lot of time validating those files.
We needed to validate if the files fulfill our expectations in terms of format, distribution, etc.

Applying great expectations in a dataset.

Get the data

For this post, we are going to use the Medical Cost Personal Dataset from Kaggle. The data contains information about users and how much money they spend in medical care, some attributes of the data are: sex, age, number of children, bmi, etc.

The dataset looks like this:

The first step consist of loading the data, then we are going to split it into train and test datasets. We are going to use the test dataset as a simulation of new data. So let’s keep the 30% of the data as test.

import pandas as pdfrom sklearn.model_selection import train_test_split
data = pd.read_csv("../data/insurance.csv")
train, test = train_test_split(data, test_size=0.3, random_state=0)

Now let’s plot the histogram of the numerical features to check if the distribution of the data is similar.

We are going to store the train and test data into csv files.

train.to_csv("../data/train.csv", index=False)
test.to_csv("../data/test.csv", index=False)

Generating expectations using GE

Now it’s time to set up our expectations. Each expectation is a test that we can apply to our data. There are different expectations, those can be basic stuff like:

Checking the existence of some columns.
Checking the data type of each column.
Check if some datetime has the correct format (for example %Y-%m-%d).

Or we can use more advanced expectations like checking if the distribution of some variable changed or the divergence of the data.

You can install great_expectations using:

pip install great_expectations

Now import great_expectationsand load the data:

import great_expectations as ge
data = ge.read_csv("../data/train.csv")

The first expectation will check if the categorical columns have the desired values:

data.expect_column_values_to_be_in_set("sex",['female','male'])
data.expect_column_values_to_be_in_set("region",['northeast', 'southwest', 'southeast', 'northwest'])
data.expect_column_values_to_be_in_set("smoker",['no','yes'])

Using expect_column_values_to_be_in_set allows to test if those column contains only the values specified. For example, if we load new data and the column sexcontains something different to femaleor maleour test will fail.

Now we are going to check the data type of each column, for this we are going to define our expectations using expect_column_values_to_be_of_type

data.expect_column_values_to_be_of_type("age","int")
data.expect_column_values_to_be_of_type("sex","string")
...
data.expect_column_values_to_be_of_type("charges","float")

Another basic expectation is checking the existence of some columns:

data.expect_column_to_exist("age")
data.expect_column_to_exist("sex")
...
data.expect_column_to_exist("charges")

For the numerical variables, we are going to add some distributional expectations. For this the first step is create a partition of the data and then use the expect_column_bootstrapped_ks_test_p_value_to_be_greater_than

age_partition = ge.dataset.util.continuous_partition_data(data['age'],)
data.expect_column_bootstrapped_ks_test_p_value_to_be_greater_than('age', age_partition)
charges_partition = ge.dataset.util.continuous_partition_data(data['charges'])
data.expect_column_bootstrapped_ks_test_p_value_to_be_greater_than('charges', charges_partition)

Now we can export our expectations, the configuration of the expectations are stored into a jsonfile that we can use to test the new data.

data.save_expectations_config("expectations.json")

If everything works fine, we should get a message like this:

WARNING: get_expectations_config discarded
0 failing expectations
17 result_format kwargs
0 include_configs kwargs
0 catch_exceptions kwargs

Apply the expectations to new data

Running the expectations on new data is quite easy, we only need to load the data and the output jsonfile from the previous step and then run the test:

test = ge.read_csv("../data/test.csv")import json
with open("expectations.json") as file:
    saved_expectations = json.load(file)
test.validate(expectations_config=saved_expectations, result_format='BOOLEAN_ONLY')

The output will be a dictwith the test results, where is specified the results of each test. In the bottom of the dictwe can get a summary, where we can check the percent of success of our test:

{'results': [{'success': True,
'exception_info': {'raised_exception': False,
'exception_message': None,
'exception_traceback': None},
'expectation_config': {'expectation_type': 'expect_column_values_to_be_in_set',
'kwargs': {'column': 'sex',
'value_set': ['female', 'male'],
'result_format': 'BOOLEAN_ONLY'}}},
{'success': True,
'exception_info': {'raised_exception': False,
'exception_message': None,
'exception_traceback': None},
'expectation_config': {'expectation_type': 'expect_column_values_to_be_in_set',
'kwargs': {'column': 'region',
'value_set': ['northeast', 'southwest', 'southeast', 'northwest'],
'result_format': 'BOOLEAN_ONLY'}}},
... A lot of stuff here...
'success': True,
'statistics': {'evaluated_expectations': 17,
'successful_expectations': 17,
'unsuccessful_expectations': 0,
'success_percent': 100.0}}

If everything is ok, then the success_percent should be 100.0.

What happens if the distribution of the data changes?

I changed the distribution of the data adding new samples with people with high values in age and charges, the original data and the modified data looks like this:

We can see that the distribution of the age and charges changed, we can apply our expectations and get the results for this new data:

When the distribution of the data changed, we can see that our test fails.

If we add a new record into our data that doesn’t has the expected column values, we’ll get a fail in that expectation when we apply the test.

fail_data = {
    "age":0,
    "sex":"other",
    "bmi":3,
    "children":0,
    "smoker":"other",
    "region":"other",
    "charges":0
}
test = test.append(pd.DataFrame([fail_data]),ignore_index=True, sort=True)
test.validate(expectations_config=saved_expectations, result_format='BOOLEAN_ONLY')

After applying our test, we can get the results:

If we integrate Great Expectations into a pipeline, we can use these results to trigger a warning or raise an exception when our data doesn’t fulfill our expectations.

Conclusions

Great Expectations is a great tool for check the sanity and validness of our data with automated testing. We can use it at different stages of our machine learning project, or our pre-processing pipelines. It can be used to validate simple stuff like the structure of our dataset our more complex like the distribution of the data.