Data Quality — Data testing and documentation using Great Expectations’ Spark Engine

Published in

Litmus7 Systems Consulting

4 min readSep 19, 2022

In continuation with our earlier blog on Data quality and Great Expectations , https://medium.com/litmus7/data-quality-relevance-in-modern-day-data-engineering-pipelines-765c0ce4c4e5

In this blog we look at how Great Expectation can be used with a Spark environment , this becomes particularly handy when we have a huge dataset to evaluate and we have to depend on Pyspark for our analysis.

Like we mentioned in the earlier post within Great Expectations the term Expectations refers to assertions for data, covering all kinds of common data issues.

Great Expectations library has wide acceptance to perform data quality check and with the Spark integration now we can execute Expectations natively within pandas DataFrames AND SQL AND Spark DataFrames. As we know Apache spark is a lighting fast cluster computing designed for fast processing. When we integrate Spark along with Great Expectations, it has several benefits as outline by them [1]

· Spark Expectations run considerably faster than they would have with redundant queries.

· Some Spark Expectations also run considerably faster for pandas and SQLAlchemy.

· Even less boilerplate required for adding new Expectations. Expectation argument validation logic is more centralized than in the past.

For doing a quick POC we have used Databricks Spark cluster and integrated Great Expectations

Let’s take an example for great expectation with spark using a data bricks community edition subscription.

To Install great expectations library in data bricks cluster using

dbutils.library.installPyPI("great_expectations")

Then using the following code snippet, we can create a spark data frame from the data uploaded in the dbfs. We are using a sample supermarket sales data to perform expectations library.

import great_expectations as gesales_df = park.read.option("header",True).csv("dbfs:/FileStore/tables/supermarket_sales.csv")

SparkDFDataset inherits the PySpark DataFrame and allows you to validate expectations against it.

from great_expectations.dataset import SparkDFDatasetsales_ge = ge.dataset.SparkDFDataset(sales_df)

We can use the following function to check the column list of the data as expected in the given list.

sales_ge.expect_table_columns_to_match_ordered_list (column_list=['Invoice ID', 'Branch', 'City', 'Customer type', 'Gender', 'Product line', 'Unit price', 'Quantity', 'Tax 5%', 'Total', 'Date', 'Time', 'Payment', 'cogs', 'gross margin percentage', 'gross income', 'Rating'])

And we are getting the result as below.

Out[7]: { "result": { "observed_value": [ "Invoice ID", "Branch", "City", "Customer type", "Gender", "Product line", "Unit price", "Quantity", "Tax 5%", "Total", "Date", "Time", "Payment", "cogs", "gross margin percentage", "gross income", "Rating" ] }, "exception_info": { "raised_exception": false, "exception_traceback": null, "exception_message": null }, "meta": {}, "success": true }

Inorder to execute the expectations function for null check , it would be executed as spark jobs and gives the following results.

sales_ge.expect_column_values_to_not_be_null(column=’Invoice ID’,result_format=”COMPLETE”)

Great expectations results can be seen sparkDFdataset as well similar to Pandas.

Using the below code snippet, we can write the result as HTML format like normal Great Expectations Result.

from great_expectations.profile.basic_dataset_profiler import BasicDatasetProfilerfrom great_expectations.render.renderer import *from great_expectations.render.view import DefaultJinjaPageViewvalidation_result = sales_ge.validate()document_model = ValidationResultsPageRenderer().render(validation_result)displayHTML(DefaultJinjaPageView().render(document_model))