How to automate data quality checks in machine learning pipelines?

Sunitha Radhakrishnan

Published in

Machine Learning Reply DACH

8 min readJan 24, 2023

Great Expectations do the magic for you!

Introduction

In the world of MLOps, the performance of a machine learning model is often a reflection of underlying data quality. In DevOps, the software developers perform unit testing for the existing features and integrate a new feature. Similarly, ML Engineers believe that running tests and adding new tests should be part of the ML model development and deployment process.

It is a good practice to implement data quality checks and code tests at any stage in the ML Ops pipeline. In the near future, one of the main focuses of MLOps would be to perform data testing and document the output of the model using open source tools/frameworks that can fulfill both goals. One such framework is Great Expectations. It is an open source framework that performs data validation, documentation and can also send notifications where the user can specify their expectations from the data in plain, interpretive statements. The main advantage is, it is very useful in projects to automate the data quality tests that leverage continuous integration and continuous deployment is a great fit for ML Ops workflow.

Different methods of deploying Great Expectations in ML pipelines.

Leverage the ‘Great Expectations’ library to declare assertions to test data in a pythonic code environment or in an interactive environment. Expectations are nothing but data assertions. They are expressed as python methods and can be used directly on pandas or pySpark dataframe.
It offers a complete deployment of Great Expectation Data Context where an expectation suite is created and validated against the data. It is two separate processes. A set of expectations(data quality checks) are written together in a suite and can be stored in any remote location(For e.g.: S3 buckets, project folder, etc.). Then, the expectations are validated during runtime in an automated pipeline against an incoming batch of data. One of the advantages is that more than one batch of data can also be specified along the pipeline for validation. In this workflow setup, the framework allows you to write the validation results in ‘Data Docs’. This document is termed as data quality report that contains the assertions about your data and the results for each validation run which can then be supplied to stakeholders.
Great Expectations can be integrated even with any orchestration workflow setup like the Airflow in a production environment. When a machine learning pipeline is triggered by an Airflow DAG runner, a node loads the expectation suite to validate data either after every transformation step or at the end of a bigger DAG subset.
Last but not least, Great Expectation integrates well into the CI/CD pipelines to test the ingestion step and model code as code changes. Similar to the unit testing that the software engineers do, data scientists and data engineers use Great Expectations for various purposes: (i) to test the data they ingest from other teams or vendors and (ii) to validate the data that they transform as a step in the data pipeline in order to ensure the correctness of the transformations. One of the advantages is that in production pipelines, the same assertions/tests (Expectations) would work for both the fixed input and “real” data input since the Great Expectations refers to the “characteristics” of the expected output.

Leveraging Great Expectations in an automated retraining pipeline.

It is time to focus on our implementation of this framework in order to perform data quality checks in an automated model retraining pipeline. When I say automated, this means that a machine learning model must be re-trained on an incoming dataset automatically on a scheduled basis and no more manual triggers should be done. As we all know, the accuracy of the model output is the result of the quality of the data that is being sent as input to the model. Here comes a rise to then perform data quality checks automatically before determining the rules for model retraining.

Steps:

Deploy Great Expectations in your project folder using the command:

great_expectations init

This will initialize the framework under the project directory and it will have the following structure:

NOTE: While productionizing, If you don’t want some files from the initialization or if some files are useless to your use case you can ignore them using .gitignore. There might be challenges when deploying in a dockerized container.

2. Firstly you need to configure the great_expectations.yml file. In this file, we can specify the data sources, stores, data docs, etc. according to the use case. This is the starting point where the great expectation’s data context is processed. In simple words it means, it takes in the data set mentioned by the user and is ready to process them in GE context.

An example of great_expectations.yml file with ‘filesystem’ as datasource is configured below:

3. Configure a python method to perform data_quality_checks by importing DataContext from GreatExpectations. Additionally, BatchRequest can be imported to perform quality checks on a single batch of data or multiple batches from GE.

4. Create an expectation suite where you can declare the set of expectations or in this case the data quality checks that you want to perform for your dataset.

5. Create a batch request and specify the datasource, data connector and data asset name when you have multiple datasources configured. Here is an example of creating a BatchRequest for all the batches associated with DataAsset.

6. Using a Validator, we can validate the batch of data against the expectation suite created. Basically getting access to the batches via a validator and running expectations.

There are other parameters that the user can include according to the requirements. For e.g.: datasource_name, data_connector_name, data_asset_name, etc.

7. The final step would be running the expectations by calling the python methods(provided by the GE) to validate the data and store the results in data docs format. For e.g.:

validator.expect_table_row_count_to_be_between()
validator.expect_column_values_to_not_be_null()
validator.expect_column_values_to_be_in_set()
validator.expect_column_to_exist()
validator.expect_column_values_to_be_unique()

NOTE: The above methods are called either without the arguments or with the arguments of the columns or values of your dataset.

8. The results in this implementation are stored in a dictionary so that they can be easily integrated with lambda payloads. The results contain both successful and failed expectations. This means that the data that is successfully validated against the expectation suite are termed as result: ‘success’ and the ones that did not meet the expectations criteria are flagged as result: ’failure’. It also gives a detailed list of why the quality check failed. Below is a sample of a failed expectation:

9. The above steps and lines of code are wrapped in a function and then integrated with AWS Lambda.

10. Finally, the results of the expectations are also sent to our project SLACK channel as alerts which are also configured as part of the above python method and specified under great_expectations.yml file. For the steps to integrate results with your slack, please take a look at GE official documentation and follow a similar process.

There are a lot of other additional work in this use case which is beyond the scope to discuss in this blog.

Challenges faced while deploying Great Expectation in AWS Lambda:

AWS Lambda is the most used serverless environment in most of the projects that aim to use an automated retraining pipeline. Since I was also using Great Expectations for the first time, I faced some difficulties while deploying the package and testing AWS Lambda function. As mentioned above, creating a data-context is the starting point for Great Expectations to work. Initially, the DataContext was using the local filesystem path and AWS Lambda was not able to recognize the path of the context. We must remember that AWS Lambda is ephemeral storage and uses a temporary file system /tmp to store the folder or the files that are created in runtime. Though in this use case, the AWS Lambda was wrapped in a docker container, the former applies to both scenarios. The solution was to replace the local filesystem path of the GE DataContext with the temporary file system path as shown below:

ge_context = DataContext(“/tmp/great_expectations”)

Summary

It is interesting to know that there are a lot of different components that this tool offers when it comes to data quality perspective. This would be useful for the MLOps guys to play around with this tool for sure if they want to setup an automated pipeline.

Some remarkable points to note are, Great Expectations can be integrated with different DAG execution tools like Spark, Airflow, dbt, prefect, dagster, Kedro, Flyte, etc. Additionally, with Databricks, MySQL, AWS Redshift, AWS S3, Lambda, BigQuery, etc., With these many integrations, productionizing the data quality aspect becomes easier since in most projects it is all about continuous integration and continuous deployment. Though there are many pros, we can also foresee some cons. When I came across the “expectations gallery” there are some expectation classes or methods still in the Beta or early Beta testing phase. They are yet to deploy some expectations/quality checks to production. This could be somewhat of a discomfort for us as they often roll out different versions. We should make sure that we migrate our code to the latest version released by Great Expectations community. For instance, I experienced an issue with using their V2 API at the initial stages and then came to know about their latest release V3 API when I figured some parts of my code did not work as expected. Henceforth, I had to migrate the code snippets to V3 API with reference to the new docs. These are some of the cons that one can take care in the early stages while prototyping and take necessary precautions before running into complications.

Conclusion

Great Expectation is a great framework to perform data quality checks in a very flexible and easily extendable way. Expectations can also be customized and new python methods or classes can be defined to perform some specific data quality checks if demanded by the project or the use case. Additionally, this package offers different components that are worth exploring and exploiting! It saves time when testing locally and productionizing the same in the pipeline. The slack community is always there to guide you along the way while playing with the tool. It is going to be a win-win situation for us as they do continuous development to the Great Expectations API while we continue to consume the different features actively.

How to automate data quality checks in machine learning pipelines?

Great Expectations do the magic for you!

Written by Sunitha Radhakrishnan