Know Your Data Pipelines with Great Expectations

Prabha Rashmi
Hashmap, an NTT DATA Company
6 min readJan 26, 2021

Data Testing, Data Profiling, and Data Validation

I remember bedtime stories having happy endings with the moral “Der Aaye Durust Aaye” (in Hindi), which means “better late than never.”

But in the data world what works is “the sooner, the better.” Otherwise you fall into an endless trap of root cause analysis, rework, frustration and negotiation.

From Giphy.com

Data pipelines are an inseparable part of the big data world, and so are their problems. Every day we receive batches of data, not knowing whether the batch actually contains relevant data. And if so, does it match our data model and schema expectations — and does it match our logical expectations.

A single instance of data drift, an outlier, an edge case, an evolution of what the data looks like — and boom….you will get unexpected results in your dashboards without actually knowing the point in the pipeline where the problem first entered. As we are getting a new batch of data daily, the pipeline's data is untested, undocumented, and therefore volatile.

We need a tool that can help test the data in our pipeline, document the results, and notify us of failures and the reasons for these failures, in turn saving us time that would otherwise be invested in root cause analysis.

So here we have Great Expectations to our rescue.

Great Expectations is a Python framework that helps automate data profiling, testing, and documenting.

Key terms you should know before starting:

  1. Data Source: Connection to data that you want to test.
  2. Batch kwargs: Configuration for generating a batch of data. A batch can be a table or a file — or a subset of one of these objects.
  3. Expectations: Assertions/tests for the data batch. The expectation suite refers to a group of expectations.
  4. Checkpoint: An object you can trigger to validate your data source against your expectations.
  5. Validations: Validation results for your data against some expectations.

How to use Great Expectations?

As a prerequisite, you should have Python and/or Jupyter notebook installed in your environment.

Installation:

pip install great_expectations

Project initialization:

great_expectations init

After running this command, you will get a project directory as follows:

It will walk you through the initial steps of the project setup. You can follow along or leave them to configure at a later stage.

Various options for data sources available are as follows:

How do computations happen in Great Expectations?

Great Expectations uses the concept of pushing compute to the data. So if you have your data on your local system or in cloud storage, you can use pandas or spark for computation. Using a data warehouse in the cloud, will convert your expectations to native code using SQLAchemy and run that native code in your data warehouse.

If you want to test files on a filesystem, Great Expectations lets you choose between pandas and Spark for computations.

If you follow along, you will get automatically profiled expectations and validations results, both in the form of a web page and JSON files for the data source you selected.

As a domain expert, you may want to write your expectations on your own or edit automatically profiled expectations for your data batch. You can do that easily using built-in expectations or write your own custom expectations.

You can use the following command for profiling data source:

great_expectations profile DATASOURCE_NAME

Create a Checkpoint:

Checkpoints can be triggered to validate your data source against your expectations. It provides you a way to embed Great Expectations functionality in your data pipeline.

great_expectations checkpoint new <checkpoint_name> <expectation_suite_name>

Run a checkpoint:

  1. From the CLI, you can run the following command:
 great_expectations checkpoint run <checkpoint_name>

2. Or, using a Python script, you can execute your checkpoint using the command:

great_expectations checkpoint script <checkpoint_name>

It will create a run_<checkpoint name>.py that you can run to trigger a checkpoint. You can use this file to embed tests in your pipeline.

What will you get from running a checkpoint?

  1. Validation result for the batch in the form of JSON file as well as a static web page.
  2. Notification if configured.
  3. The following are screenshots for sample notifications on slack.

Here is a workflow showing how you can leverage the capabilities of Great Expectations to integrate your user acceptance criteria and quality assurance tests in your data pipeline.

From greatexpectations.io

Final Thoughts

Data docs will be generated every time a batch is tested. Dynamic documentation ensures your docs never go stale. You can host these static web pages over the cloud and share them with your team and always know what is going on with your data.

Also, rather than being discrete with your expectations, you can set up a tolerance value for an expectation. For example, 95% of Column values should be not null.

Apart from simply showing the test is unsuccessful, Great Expectations provides stats like how many values were fulfilled and failed for a particular expectation pacing up the data quality error detection process.

The Great Expectations tool has immense potential, and its community is growing rapidly. Updates and improvements are being continuously made. It can revolutionize the way data pipelines are maintained and tested, saving a lot of time and effort.

Ready to Accelerate Your Digital Transformation?

At Hashmap, we work with our clients to build better together.

If you’d like additional assistance in this area, Hashmap, an NTT DATA Company, offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and I would be glad to work through your specifics. Reach out to us here.

Feel free to share on other channels, and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Other Tools and Content You Might Like

Prabha Rashmi, is a Cloud and Data Engineer with Hashmap, an NTT DATA Company, providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with Prabha on LinkedIn.

--

--