Pythonic data (pipeline) testing on Azure Databricks

Ever wondered how to test data and data pipelines in an effective way without setting up a comprehensive enterprise-grade data quality solution?

Andreas Hopfgartner
CodeX

--

Python provides a lot of great packages for unit testing code. With that, you can check that your pipeline code is running. You certainly use fixtures to isolate unit tests and don’t make unit tests dependent on changing data. But how can you check if your changed data transformations affect the data pipeline result expected from your machine learning model?

Some time ago, I wrote this blog article on Medium on the value of open source Python package Great Expectations.

At the time, I was a consultant supporting my client on a machine learning project. We were using a lot of internal data sources. When we started the project, the goal was to quickly prototype the product and develop the intelligent part as an interface. We replicated all the data from its original source into a data lake. Of course, at this early stage of exploration, we did not have appropriate pipelines nor contracts for the interfaces.

For the real-time ML model we consumed data from a Kafka stream, that was fed out of a mobile application that customers use. One day, our model went crazy and it took us some time to figure out what happened. The appdev team changed the source code of the app, which resulted in messy data. As we were the only team that really consumed this data, we were the only folks that were affected.

We needed a quick fix to prevent this issue from happening again. As we were Python guys, we wanted to give Great Expectations a chance. The setup of the minimal environment was done in few days. And we were not disappointed.

Now I experienced similar settings, but I work with customers setting up their ML architectures on Azure. I knew from my earlier project, that Great Expectations needs to be set up and configured through a yaml file and the resulting artifacts (expectation suites, validation runs, DataDocs) are stored on a filesystem, too. So to me this seemed a bit clunky to set up on cloud services like Azure Databricks and Azure Machine Learning and I took some time to do my research how this could be achieved in hosted environments.

A wonderful feature of Great Expectations are the DataDocs and the opportunities this feature opens up for teams.

Image taken from Great Expectations YouTube channel.

DataDocs are not only a documentation of the data used, or display the results of data validation, but can also be used as data contracts. Subject matter experts, data scientists and data engineers can work on the same page to make sure the data is fully understood and data quality is ensured.

To my opinion, it’s particularly interesting in a MLOps pipeline, where you need apart from testing your code itself your pipeline code, data drift and validate data that you’re consuming from other data sources or vendors.

Alright, but before we get down to the nitty-gritty, I’d like to tell you how I use the package! I see Great Expectations as a framework and pick out the things I find useful. So when I’m working using Spark I only configure the DataSources to provide metadata information that I can use later on. I mostly load a batch of data manually and put the Spark DataFrame into the batch_kwargs (or … in the new API).

💻 Find the code attached at the end of the blog article.

The setup of a Great Expectations project can be done in two ways. Chose way one if you’re running on a system where you have a file-system ready. Start with great-expectations init to create the configuration scaffold and configure your DataContext accordingly.

Resulting folder structure when using Great Expectations when using a filesystem.

When you’re running on a hosted system like PaaS in the cloud, you might not always have a CLI available or a file system where you easily can store the folder persistently. Using Azure Databricks or Azure Machine learning, theoretically you have access to CLI or file systems (like DBFS on ADB or Linux on AML’s compute instances). But that does not seem to be the best option, because in the cloud you better off using other options like having a more central place to store your configuration and artifacts.

I started by watching the webinar on YouTube, and found a lot of helpful material in the docs (how to start without a YML file, how to run on Databricks).

As said before, I like to specify the location of the data to have that available as metadata, but do not want to use the ExecutionEngine to generate me a batch of data. I will do that manually later.

Minimal configuration for DataSource.

I used Azure Key Vault backed scope to manage my secrets. Doing that it’s pretty easy in combination with Azure Databricks or AzureML. So that’s one advantage for me that comes in handy when using configuration as code.

Using Key Vault backed scope for managing secrets.

In case of using the yaml file to configure the DataContext, you can either use a secrets config file that is not checked into the git repository or use the same approach using Azure Key Vault.

Let’s start with the configuration of the DataSource. I’d like to use that config to use it as metadata later in the DataDocs, but do not necessarily want to load a batch of that using the ExecutionEngine, I’ll do that manually later.

Configuration of DataContext.

I’d like to store all my artifacts (expectations, validations) on an Azure Blob storage.

Set the configuration to store expectations on Azure Blob storage.

The DataDocs will also be pushed to an Azure Blob storage, but we’ll use the \$web container to host the static website.

Configuration of the DataDocs

Creating a batch of data (Spark DataFrame) manually…

Loading a batch of data.

… and here we go. Let’s create some expectations. You might want to check the documentation or just give it a try. It’s pretty much self-explanatory. So we’re just jumping to the next stage. After saving the expectation suite we can run a validation or create a checkpoint. We will use the run_validation_operator to validate a batch of data against a batch of data. It will do a validation run, push the resulting artifacts to the stores and update the DataDocs.

Do not forget code versioing.

Finally, browse the DataDocs and see the validation results.

One final note: Cookiecutter is a great help when you have lots of data sources and an elaborate data pipeline where you need to check each single transformation step.

In the next blog article we will have a look at Delta Live Tables and see how this new (preview) Databricks feature can be useful for pipeline testing.

Complete code to run Great Expectations pipeline test on Azure Databricks.

--

--

Andreas Hopfgartner
CodeX
Writer for

Working as Cloud Solution Architect for Data & AI and also in the realm of Internet of Things for Microsoft in Germany.