Secure Data Quality with Great Expectations in Databricks

Ziang Jia
4 min readAug 23, 2022

--

Validating data quality has became one of the top priorities for both data engineers and data scientists in their day-to-day data analysis. Poor data quality could cause inaccurate reports or predictions that lead to decisions resulting in the opposite outcome.

Although plenty of tools can facilitate the data quality validation job, it is ultimately the users’ responsibility to deliver the best outcome by using it in the expected way. The tool would not deliver the promised efficiency if developers are reluctant to use it. If this happens, it might because that the tool is not well-integrated with the platform where developers can access easily. Therefore, when shopping for the “best” data quality tool, we want to make sure the tool is integrated and fully accessible to the end users.

Great Expectations is a leading open-source tool for validating, documenting, and profiling data. The design pattern under the surface is very well known to developers — automate unit tests. Some developers hate unit tests and are reluctant to do them until they are well-integrated with CI/CD pipelines. Great Expectation can be well-integrated with the data analytics platform, such as Databricks, AWS EMR, Airflow, BigQuery, etc.

In this article, I will demonstrate Great Expectations with Azure Databricks. Note that different platforms may have limitations, so it is better to refer to the official documentation.

Infrastructure

Firstly, follow the official tutorial to create a Databricks workspace. Then create an Azure ADLS Gen2 storage account. Configure Azure storage account to allow hosting static website. Mount this account to our Databricks DBFS so we can use it as the storage layer for the lakehouse. Create a database using this mount location, and all delta tables that belong to this database will be stored in this ADLS container.

dbutils.fs.mount(
source = "wasbs://<CONTANER_NAME>@<ACCOUNT_NAME>.blob.core.windows.net",
mount_point = "/mnt/lakehouse",
extra_configs = {"fs.azure.account.key.<ACCOUNT_NAME>.blob.core.windows.net": dbutils.secrets.get(scope=<SCOPE_NAME>, key=<SECRET_KEY>) }
)
%sql
create database if not exists tutorial
comment "Proof of Concept database for tutorial use case"
location "/mnt/lakehouse/warehouse"

Next, upload some sample data to the lake storage. In my example, I made two tables in the lakehouse called tutorial.bronze_claim_detail and tutorial.bronze_patient.

Create a cluster and install the necessary libraries for Great Expectations. As we will be using Azure blob storage to host output for data validators, we need to install azure-storage-file-datalake and azure-identity.

Great Expectations Validator

Great Expectations has a couple of components — Data context, Datasource, Expectations, Validation Results, and Data Docs. The first two control most inputs and configurations, the Expectations are defined by developers, and the last two deliver outputs of validation and documentation, respectively.

Data engineers can configure Validation Results or Data Docs to use external storage to persist output and share it with others. In the Databricks notebook, one can initiate the Data context as follows.

data_context_config = DataContextConfig(
## Local storage backend
store_backend_defaults=FilesystemStoreBackendDefaults(
root_directory="/dbfs/great_expectations/"
),
## Data docs site storage
data_docs_sites={
"az_site": {
"class_name": "SiteBuilder",
"store_backend": {
"class_name": "TupleAzureBlobStoreBackend",
"container": "\$web",
"connection_string": """DefaultEndpointsProtocol=https; EndpointSuffix=core.windows.net;
AccountName=<ACCOUNT_NAME>;
AccountKey={}""".format(dbutils.secrets.get(scope=<SCOPE_NAME>, key=<SECRET_KEY>)),
},
"site_index_builder": {
"class_name": "DefaultSiteIndexBuilder",
"show_cta_footer": True,
},
}
},
)

This tells the Great Expectations to write Data Docs static HTML files to the $web container in the Azure ADLS account. The $web container is a designated container block that serves static web content if enabled. It has a URL endpoint like https://<ACCOUNT_NAME>.z13.web.core.windows.net to serve the Data Docs index.html through web service.

One can follow the official tutorial to build an example in Databricks workspace. In my example, I have two expectation suites created in my validation job — checking the primary key and the date range. I can use the website to navigate through the validation outputs.

Validator in Databricks Workflow

To improve efficiency, data engineers could integrate their validators with Databricks workflow. In my example, I add the validator notebook as a downstream task immediately following the incremental loading task.

Conclusion

Accessibility is the always one of the key feature for a tool to be popular. By Well-integrating with various platforms, Great Expectations earns its reputation among developers.

--

--

Ziang Jia

Data Analytics Solution Architect | Cloud DevOps | AI & ML @Resultant partner with Microsoft, GCP, AWS, Databricks | Expertise in Kubernetes, Spark, Python, SQL