Data Validation — Measuring Completeness, Consistency, and Accuracy Using Great Expectations with PySpark
By Christopher Getts, Data Scientist
Motivation and Defining Metrics
"Big Data" - As much as a buzzword as this is, modern companies are continuously collecting and relying on data to guide process and decision. Missing or incorrect data can compromise any downstream decision. As a team that supports a platform which offers access to various "big data sets", we took on the task of verifying the quality of our data.
The tricky part is that data validation can take many forms depending on:
- Data set specific context
- Data set specific use cases
- The size of the data and number of data sources
- The need for automation vs ad hoc testing
Fundamentally, we want want a robust solution for data validation that's applicable across any data set. Additionally, we require solutions that are automated and that can be inserted into our existing data pipelines. For this reason our team decided to implement rule-based checks against key metrics that can be configured to fit specific data context into our pipeline which we could apply to any data set.
To start, we identified three key metrics to evaluate our data against: completeness, consistency, and accuracy.
We define these as:
- completeness: Refers to the degree to which an entity includes data required to describe a real-world object. In tables in relational database systems, completeness can be measured by the presence of null values, which is usually interpreted as a missing value.
- consistency: The degree to which a set of semantic rules are violated such as a specific data type, an interval for a numerical column, or a set of values for a categorical column.
- accuracy: The correctness of the data and can be measured in two dimensions: syntactic and semantic. Syntactic accuracy compares the representation of a value with a corresponding definition domain, whereas semantic accuracy compares a value with its real-world representation.
Great Expectations
Great! We now have our definitions, but how do we measure our data against these principals? This is where Great Expectations comes in. From their website, "Great Expectations is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams."
That's a nice tagline for Great Expectations, but how does it work?
With Great Expectations, we create assertions on the data (these are the expectations), like "I expect the primary key column not to be null" or "I expect the variable `shirt size` to be in the set of {small, medium, large}". For each data set, we collect expectations into a suite. From there, we can validate the data against the suite of expectations, either ad hoc or as part of an ETL. Great Expectations will tell us whether an expectation passes or fails, and if it returns any unexpected values. The best part, is that Great Expectations renders the expectations and validations in an HTML site (Data Docs) as a continuously updated data quality report.
Now that have our principal metrics, completeness, consistency, and accuracy, and our tool, Great Expectations, we might ask ourselves "How do they relate?".
- To measure completeness, we will create expectations that check the sparsity of columns, or what percentage of each column's values are null. We will also create expectations
that assert whether we expect certain columns to not contain any null values, for example, primary key columns. - To measure consistency, we will rely on the validations to tell us how often an expectation fails and the frequency of unexpected values by column.
- To measure accuracy, we will build expectations based on:
- Valid domain values and ranges. Example, a column latitude would have logical values between -90.0 and 90.0.
- Cross referencing our expectations with a domain expert who can provide knowledge and sanity checks. Example, if a column contains a VIN, a domain
expert may tell us to check that the VIN is 17 digits long.
Tutorial
This tutorial will go over how to set up Great Expectations to work with a PySpark DataFrame and S3 to host the Data Docs.
1. Project Config
Here's we're telling Great Expectations the location where the expectations suites, validations, and data docs will live. We're also specifying a PySpark DataFrame as our data set and data source.
Further steps (not covered here) should be taken to enable the S3 bucket to host a static HTML site.
from great_expectations.data_context.types.base import DataContextConfigfrom great_expectations.data_context import BaseDataContextproject_config = DataContextConfig(config_version=2,plugins_directory=None,config_variables_file_path=None,datasources={"my_spark_datasource": {"data_asset_type": {"class_name": "SparkDFDataset","module_name": "great_expectations.dataset",},"class_name": "SparkDFDatasource","module_name": "great_expectations.datasource","batch_kwargs_generators": {},}},stores={"expectations_S3_store": {"class_name": "ExpectationsStore","store_backend": {"class_name": "TupleS3StoreBackend","bucket": "bucket", # enter bucket here"prefix": "ge/expectations",},},"validations_S3_store": {"class_name": "ValidationsStore","store_backend": {"class_name": "TupleS3StoreBackend","bucket": "bucket", # enter bucket here"prefix": "ge/validation",},},"evaluation_parameter_store": {"class_name": "EvaluationParameterStore"},},expectations_store_name="expectations_S3_store",validations_store_name="validations_S3_store",evaluation_parameter_store_name="evaluation_parameter_store",data_docs_sites={"s3_site": {"class_name": "SiteBuilder","store_backend": {"class_name": "TupleS3StoreBackend","bucket": "bucket", # enter bucket here"prefix": "ge/data_docs",},"site_index_builder": {"class_name": "DefaultSiteIndexBuilder","show_cta_footer": True,},}},validation_operators={"action_list_operator": {"class_name": "ActionListValidationOperator","action_list": [{"name": "store_validation_result","action": {"class_name": "StoreValidationResultAction"},},{"name": "store_evaluation_params","action": {"class_name": "StoreEvaluationParametersAction"},},{"name": "update_data_docs","action": {"class_name": "UpdateDataDocsAction"},},],}},anonymous_usage_statistics={"enabled": True})context = BaseDataContext(project_config=project_config)
2. Initialize Expectation Suite
I usually create one expectation suite per data source that houses all the expectations for that data source.
suite = context.create_expectation_suite('tutorial')
3. Initialize sample batch data set from PySpark DataFrame
Great Expectations uses the batch object to generate expectations. This initial batch object can be a sample of a full data set as we are only using this batch to initially create expectations. Here, I’m creating a sample data set.
import pandas as pdimport pyspark.sql.types as T# Create sample PySpark DataFramedata = [(1, "Small"), (2, "Medium"), (3, "XL")]pandas_df = pd.DataFrame(data, columns=["order_id", "shirt_size"])schema = T.StructType([T.StructField('order_id', T.IntegerType(), nullable=True),T.StructField('shirt_size', T.StringType(), nullable=True),])df = spark.createDataFrame(pandas_df, schema=schema)batch_kwargs = {"datasource": "my_spark_datasource","dataset": df}batch = context.get_batch(batch_kwargs=batch_kwargs,expectation_suite_name=suite)
4. Create Expectations and Save Suite.
Full list of standard expectations can be found here.
batch.expect_column_values_to_not_be_null('order_id')batch.expect_column_values_to_be_in_set(column='shirt_size', value_set=['Small', 'Medium', 'Large']) # Purposefully put 'large' to illustrate a failed expectationbatch.expect_column_values_to_be_of_type(column='shirt_size', type_='StringType') # Use pyspark data types as found in pyspark.sql.typesbatch.save_expectation_suite(discard_failed_expectations=False)
5. Validate
Once the suite has been built, we can now use that suite to validate our batch.
results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])validation_result_identifier = results.list_validation_result_identifiers()[0]context.build_data_docs()
6. Check out the data docs!
Here’s the validation we just ran. We can see the expectations we created organized by column and metrics like whether or not they passed or if any unexpected values were encountered.
Final Thoughts
Great Expectations has “great” utility for measuring our three key metrics: completeness, consistency, and accuracy. While it does require some overhead of initializing and creating a suite of expectations, the validation step can easily be written into a custom module and inserted into an ETL pipeline or ran as a standalone script. To reduce some of the overhead or uncertainty around selecting which expectations to perform, check out the automatic data profiler flow here.
Our team has a current plan to add some additional data quality checks, particularly to identify outlier data. As always we are open to ideas and feedback from our readers. If you would like to read our first part of our data quality series, click here. If this is a topic you have worked on, have ideas and willing to brainstorm with us, please reach out here.