Great Expectations

Published in

SYNTIO

18 min readJul 12, 2023

Introduction

Great Expectations (GX) is an open-source Python library for data validation that allows you to define, manage, and automate the validation of data inside your data pipelines. In this blog post, we will showcase the power and flexibility of Great Expectations by walking through an end-to-end pipeline, using the tool with the Airflow workflow management platform. The goal is to demonstrate the practical application of Great Expectations in a real-world scenario.

What is Great Expectations?

To begin with, let’s go through what Great Expectations is all about. Usually, most pipeline complexity lives in the data, not the code. So instead of just testing code, we should also be testing data. Great Expectations provides a way to define and enforce expectations about data, which can be used to detect errors or unexpected behavior in data pipelines. With Great Expectations, you can easily set up expectations for your data, such as the expected value range of a particular column or the expected number of rows in a dataset.

These expectations can be considered as unit tests for data and can be defined at any point in the data pipeline, from the input source to the output destination. Great Expectations automatically evaluates your data against these expectations, provides you with the results, and alerts you of any discrepancies.

What does Great Expectations NOT do?

To understand Great Expectations better, it is important to know what it doesn’t do:

Great Expectations doesn’t serve as a pipeline execution framework. Instead, it easily integrates with DAG execution tools such as Airflow or Spark. It doesn’t execute pipelines, but instead, validation can be run as a step in the pipeline. For example, when using GX with Airflow, you should simply install the GreatExpectationsOperator, set its parameters, and use it as you would any other operator in Airflow.
Great Expectations doesn’t function as a database or storage software. Instead, it operates by processing your data in place, directly at the source.
Great Expectations doesn’t restrict itself to Python programming environments. It offers the flexibility to invoke it from the command line without the need for a Python environment.
Great Expectations doesn’t serve as a data versioning tool. It doesn’t store data itself but rather deals with metadata about data: Expectations, Validation Results, etc.

Key features

Great Expectations offers several key features that make it a valuable tool for validating and documenting data:

Data Validation: Great Expectations allows you to create data validation rules or expectations for your data. The tool checks if the data meets expectations and returns a report on the validation results.
Customizable Expectations: It allows you to create custom expectations or rules to suit your specific data needs. You can create expectations based on the data type, format, or even business logic.
Automated Testing: Great Expectations offers automated testing functionality, which helps you catch data pipeline issues early in the development process.
Documentation: Great Expectations can automatically generate data documentation for your data pipelines, making it easy to keep track of changes over time.
Data Profiling: Great Expectations can also automatically profile your data to determine basic statistics such as the mean, standard deviation, and null values.
Support for various Datasources and Store backends: Great Expectations currently supports native execution of Expectations against various Datasources, such as AWS Redshift, AWS S3, BigQuery, Google Cloud Platform (GCP), Microsoft Azure Blob Storage, Snowflake, Spark, Trino, etc. (you can find the full list here). It is worth noting that you have the flexibility to validate your data even if it is not stored in a traditional database. For instance, you can utilize the in-memory datasource to load data from a CSV file directly into a Pandas DataFrame, enabling you to evaluate your source data efficiently. Furthermore, Great Expectations allows you to store all relevant metadata, such as the Expectations and Validation Results in file systems, database backends, as well as cloud storage such as S3, Google Cloud Storage, and Azure Blob Storage, by configuring metadata Stores.
Integration with DAG execution tools: Great Expectations integrates with DAG execution tools like Spark, Airflow, dbt, prefect, dagster, Kedro, etc.
Security and transparency: Great Expectations doesn’t ask you to exchange security for your insight. It processes your data in place, at the source, so your security and governance procedures can maintain control at all times.

How does it work?

As already mentioned, Great Expectations works by allowing you to define expectations for your data. Expectations can be defined anywhere in your data pipeline. When the data reaches the validation point, the tool will automatically check your data against these expectations, providing you with the results and alerting you to any discrepancies.

The great_expectations library can be used either through a Python API or a Command Line Interface (CLI). CLI commands either run entirely in the terminal or launch Jupyter notebooks. Notebooks are a simple way of interacting with the Great Expectations Python API. All notebooks you use will automatically be saved in a project. You could also just write all of this in plain Python code, but Great Expectations provides some boilerplate code and additional comments in these notebooks to assist you in completing a task that requires a more complex configuration.

Great Expectations also offers an online interface, currently in beta. The GX Cloud interface is being marketed as a fully managed SaaS that offers expectation editing directly in the UI, built-in trend visualizations, and persistent links for Data Docs and Validation Results.

Now that we have a clear understanding of what Great Expectations is and what we will be doing in this blog post, let’s get started with the demo and explain Great Expectations in more detail.

Demo

Introduction to dbt

Before we dive in, let’s take a moment to introduce dbt (data build tool). While the primary focus of this demo is on Great Expectations, it is also important to understand the role of dbt since it is used in the pipeline.

dbt is a powerful tool that enables you to transform data in warehouses by writing SQL select statements. It acts as a data transformation and modeling tool within your data pipelines. With dbt, you can define, test, and document your data transformation logic using SQL, ensuring a structured and organized approach to your workflows. It connects to various data platforms such as Snowflake, BigQuery, Redshift, Postgres, and Databricks, allowing you to run SQL queries against your chosen data platform.

In our demo, we will use dbt after loading the source files into a database. dbt will be leveraged to create a simple analytical table. To seamlessly transform the data within the database, dbt takes SQL select statements as input and efficiently converts them into the appropriate DDL (Data Definition Language) or DML (Data Manipulation Language) commands. While dbt doesn’t handle data extraction or loading, it excels at transforming data that’s already loaded into your warehouse. In other words, dbt acts as the transformation stage (T) in ELT processes. Each stage of the transformation pipeline in dbt is referred to as a “model” and is defined in a SQL file.

Now that we have a brief understanding of dbt, let’s proceed with exploring how Great Expectations integrates with dbt to provide comprehensive data validation capabilities within our pipeline.

Demo overview

This demo is an adaptation of the official Great Expectations tutorial available on their GitHub repository. However, the original tutorial was based on an older version of GX. To provide you with the most up-to-date experience, this demo has been updated to align with the latest version, which is 0.16.8 at the time of writing. By following this demo, you will experience the latest features and improvements offered by Great Expectations.

For the demo, we will use a simple pipeline that loads the source CSV files into a PostgreSQL database using SQLAlchemy and then uses dbt to create a simple analytical table. Great Expectations is used to add three data validation steps. The first one at the beginning of a pipeline is to validate the input CSV files. The next step comes after the loading of data into the PostgreSQL database. It is used to validate that the data was loaded into the database successfully. The last validation step is added after the transformations to validate the analytical results. The whole pipeline, with its data validations, is automated using the Airflow platform.

To provide a visual representation of the automated pipeline and its data validation steps, the following Airflow DAG illustrates the sequence of tasks in the pipeline, including the Great Expectations validation steps.

Dataset

As already mentioned, the dataset used in this demo consists of two CSV files.

The npi_small.csv file contains information about healthcare providers. Each provider is identified by a unique number called the National Provider Identifier (NPI). Here is a small sample of the npi_small.csv file:

NPI,Entity_Type_Code,Organization_Name,Last_Name,First_Name,State,Taxonomy_Code
1457900839,2.0,TEXAS CLINIC OF CHIROPRACTIC,,,TX,111N00000X
1255519047,1.0,,BRYANT-JONES,MARIA,FL,261QH0700X
1366091746,1.0,,JONES,EBONY,DC,3747P1801X

The other CSV file is named state_abbreviations.csv, and it contains the USA’s states and territories that are mentioned in the npi_small.csv file. Here is a sample of the state_abbreviations.csv file:

"name","abbreviation"
"Alabama","AL"
"Alaska","AK"
"Florida","FL"

Prerequisites

This demo assumes you have:

A supported version of Python (version 3.7 or greater)
An empty PostgreSQL database tutorials_db that can be accessed using an SQLAlchemy connection URL
Airflow:
- Make sure you have Airflow installed and set up
- Point the dags_folder in airflow.cfg to the root directory of this project
dbt:
- Make sure that you have dbt installed and set up
- Add your database credentials to dbt_profile.yml

Step 1: The setup process

To get started with Great Expectations, we first need to create a new project. Since Great Expectations is a Python library, we will use PyCharm for this demo. As a best practice, it is recommended to use a virtual environment to partition your GX installation from any other Python projects that may exist on the same system. One way to do this is by following these steps:

Open your PyCharm project and go to File -> Settings.
In the left-hand panel, click on Project -> Python Interpreter.
Click on the gear icon in the top-right corner of the panel and select Add Interpreter.
Choose Virtual Environment and select the Python interpreter version you want to use.
Choose a location for your virtual environment and click Create.
Once the virtual environment is created, select it from the list of available interpreters and click OK.

If you’re using Windows, you can activate your new virtual environment by navigating to the root of your project and running the following command:

.\venv\Scripts\activate

Now, you can install the dependencies needed for this demo:

pip install great_expectations==0.16.8
pip install sqlalchemy==1.4.16
pip install apache-airflow==2.6.1
pip install psycopg2==2.9.6
pip install airflow-provider-great-expectations==0.1.1

Step 2: Initializing Data Context

Now that we have installed GX with the necessary dependencies for working with SQL databases and Airflow, we are ready to initialize the Data Context. The Data Context contains the entirety of your Great Expectations project and provides the entry point for all of the primary methods you will use to configure and interact with Great Expectations. The scope of a Data Context object will depend on your specific use case and requirements. Generally speaking, you can create a Data Context for a domain, a single database, or multiple databases in a given environment. However, creating different Data Contexts for overlapping sources should be discouraged, as it can lead to redundancy and increased maintenance. If you have overlapping sources that require different Expectations or configurations, it is better to create separate Suites (a collection of Expectations) within the same Data Context rather than creating multiple Data Contexts. This approach allows you to maintain consistency and avoid duplication of effort while still managing the specific requirements of each source.

The simplest way to create a new Data Context is by using Great Expectations’ CLI. From the directory where you want to deploy Great Expectations, run the following command:

great_expectations init

This command creates a new directory called great_expectations with the following structure:

great_expectations 
|-- great_expectations.yml 
|-- expectations 
|-- checkpoints 
|-- plugins 
|-- .gitignore 
|-- uncommitted 
    |-- config_variables.yml 
    |-- data_docs 
    |-- validations

The layout of the repository above is specific to the current version of GX (0.16.8).
Great Expectations is still a rapidly evolving tool, with new features and updates constantly being released. One of them is the repository layout, which has already undergone significant changes in versions prior to 0.16.8, and it is possible that it may change again in the future.

Inside your great_expectations directory, you will find great_expectations.yml (the main configuration of your deployment) and several subdirectories, including expectations (stores all your expectations as JSON files), checkpoints (stores your checkpoints – pairs of expectation suites and datasets), plugins (code for any custom plugins you develop), and uncommitted (files that shouldn’t live in version control, such as validations and data docs). By default, all of the files used in your project (expectations, validations, data_docs) will be stored on a filesystem in these folders, but you can also store them in other supported stores, as we will see later.

There is also a MetricStore for storing metrics computed during validation, but it is still an experimental feature of Great Expectations.

If you have any credentials you want to use, you can decide where you would like to save them — in a YAML file, environment variables, or a combination. In most cases, it is suggested to use a config variables YAML file because YAML files make variables more visible, easily editable, and allow for modularization (e.g., one file for dev, another for prod). If you choose to use a YAML file, save the desired credentials or configuration values to great_expectations/uncommitted/config_variables.yml. By default, Great Expectations uses this file to store configuration variables and credentials. However, you can configure the file name and location by modifying the great_expectations.yml file. The default configuration is set as follows:

config_variables_file_path: uncommitted/config_variables.yml

For the purposes of this demo, we will have to use credentials for our PostgreSQL database. We will add our credentials later, after creating a Datasource.

Step 3: Connecting to the data

Connecting to your data is built around the Datasource object. Datasources tell Great Expectations where your data lives and how to get it. Our pipeline requires two datasources. One is for the CSV files on a filesystem, and the other is for data inside the relational database. The recommended way for creating datasources is by using the CLI command, which will open a Jupyter Notebook for you to complete with your specific needs.

As we already mentioned, we will use two CSV files for this demo. To get started, we need to copy these files to our project. We will create data folder at the root of our project repository and then place our two CSV files inside it.

Files on a filesystem

For our first datasource, we will run the command: great_expectations datasource new. This will result in a few prompts, the first of which is:

What data would you like Great Expectations to connect to?
    1. Files on a filesystem (for processing with Pandas or Spark)
    2. Relational database (SQL)

Since we are using CSV files in the data folder, we are choosing the option 1. Then, we will get the second prompt:

What are you processing your files with?
    1. Pandas
    2. PySpark

After choosing Pandas and selecting the datasource name, a new Jupyter Notebook will open, where we will customize our datasource configuration.

The Data Context that is initialized by get_data_context()will be the Data Context defined in your current working directory. It will provide you with the convenience methods used for validating and adding the datasource to our project.

Then we will create a dictionary for our datasource configuration. At the top level, the Datasource’s configuration will need the following keys:

name: The name of the Datasource, which will be used to reference the datasource in Batch Requests.
class_name: The name of the Python class instantiated by the Datasource. Typically, this will be the Datasource
module_name: The name of the module that contains the Class definition indicated by class_name.
execution_engine: A dictionary containing the class_name and module_name of the Execution Engine instantiated by the Datasource.
data_connectors: The configurations for any Data Connectors and their associated Data Assets that you want to have available when utilizing the Datasource.

More about all these configurations can be found here.

When the configuration is ready, you can validate it and then add the new datasource to your data context by running the rest of the cells in a notebook.

The whole process of creating a datasource will look like this:

from great_expectations.core.yaml_handler import YAMLHandler
import great_expectations as gx

yaml = YAMLHandler()
data_context: gx.DataContext = gx.get_context()

datasource_config: dict = {
  "name": "input_files",
  "class_name": "Datasource",
  "module_name": "great_expectations.datasource",
  "execution_engine": {
    "class_name": "PandasExecutionEngine",
    "module_name": "great_expectations.execution_engine",
  },
  "data_connectors": {
    "default_inferred_data_connector_name": {
      "class_name": "InferredAssetFilesystemDataConnector",
      "base_directory": "../data",
      "default_regex": {
        "pattern": "(.*)\\.csv",
        "group_names": ["data_asset_name"],
      }
    }
  },
}
data_context.test_yaml_config(yaml.dump(datasource_config))

# If a Datasource with the same name is already created, then this command will overwrite it
data_context.add_datasource(**datasource_config)

Relational database

For validating data inside a PostgreSQL database, we have to create a second datasource. We will use the same command great_expectations datasource new, but now we will choose the option 2. Relational database (SQL) at the first prompt. This will result in the second prompt, where we will choose the option 2, since we are using a Postgres database.

Which database backend are you using?
    1. MySQL
    2. Postgres
    3. Redshift
    4. Snowflake
    5. BigQuery
    6. Trino
    7. Athena
    8. other - Do you have a working SQLAlchemy connection string?

A similar notebook will be created as in the previous step, but now we will configure it a bit differently. class_name and module_name are the same as before, but the execution_engine is where we will specify that we want this Datasource to use SQL in the backend. Additionally, the execution_engine dictionary will require values for either connection_string or credentials. Only one of these keys is needed, as they each serve the same purpose: to provide the parameters necessary for the SqlAlchemyExecutionEngine to connect to the desired database. By default, Great Expectations uses credentials and prompts you to input your credentials as strings. These credentials are then hardcoded in the great_expectations.yml file, which is not an ideal solution. Instead, we will utilize the config_variables.yml file to securely store and manage the credentials.

The configuration of the datasource should look like this:

my_postgres_db:
  class_name: Datasource
  module_name: great_expectations.datasource
  execution_engine:
    module_name: great_expectations.execution_engine
    class_name: SqlAlchemyExecutionEngine
    credentials: ${my_postgres_db_yaml_creds}
  data_connectors:
    default_inferred_data_connector_name:
      class_name: InferredAssetSqlDataConnector

We also have to configure credentials for our PostgreSQL database that we have referenced inside our datasource configuration. We will do this by adding my_postgres_db_yaml_creds to our great_expectations/uncommitted/config_variables.yml file. It will be formatted like this:

my_postgres_db_yaml_creds:
  drivername: postgresql
  username: postgres
  password: <password>
  host: localhost
  database: tutorials_db
  port: '5432'

After creating both datasources, you can find their configurations in great_expectations.yml, where you can manually reconfigure them if needed.

Step 4: Creating expectations

When creating expectations for your datasources, it is best to use the CLI to generate a helper notebook using the command: great_expectations suite new. This will result in a prompt:

How would you like to create your Expectation Suite?
    1. Manually, without interacting with a sample Batch of data (default)
    2. Interactively, with a sample Batch of data
    3. Automatically, using a Data Assistant

For this demo, we will choose the default option 1, and then select the name for our expectation suite. The new notebook will be created.

All the notebooks you use inside a project will be saved in the great_expectations/uncommitted folder.

The first cell in the notebook includes importing necessary modules and creating an empty suite, and you should just run it. Afterward, you should add a new cell with an expectation configuration. This configuration specifies the type of expectation being added and its associated arguments.
For example, the code snippet below demonstrates adding an expectation that checks if the columns of a table match an ordered list:

# Create an Expectation
expectation_configuration = ExpectationConfiguration(
  # Name of expectation type being added
  expectation_type="expect_table_columns_to_match_ordered_list",
  # These are the arguments of the expectation
  # The keys allowed in the dictionary are Parameters and
  # Keyword Arguments of this Expectation Type
  kwargs={
    "column_list": [
      "NPI", "Entity_Type_Code", "Organization_Name", "Last_Name", "First_Name", "State", "Taxonomy_Code"
    ]
  },
  # This is how you can optionally add a comment about this expectation.
  # It will be rendered in Data Docs.
  meta={
    "notes": {
      "format": "markdown",
      "content": "Some clever comment about this expectation. **Markdown** `Supported`"
    }
  }
)

# Add the Expectation to the suite
suite.add_expectation(expectation_configuration=expectation_configuration)

You can add any number of expectations to your suite. After you are finished adding them, you should run the final cell in the helper notebook to save your expectation suite as a JSON file in the great_expectations/expectations folder. Besides JSON files, you can also find HTML files in great_expectations/uncommitted/data_docs/expectations/ for each of your expectation suites.

In the image below, you can see an expectation suite named npi_small_file.critical, which includes the expectation described in the previous paragraph. The image provides a visual representation of the expectation suite and its contents.

For the needs of this demo, we have created 3 expectation suites:

npi_small_file.critical – for validating the input npi_small.csv file
npi_small_db_table.critical – for validating the data loaded into the PostgreSQL database
count_providers_by_state.critical – for validating the analytical output

Step 5: Validating data

Once you have defined your expectations, you can use Great Expectations to validate your data. The recommended workflow for validating data is through the use of Checkpoints. A Checkpoint defines which expectations should be run against which datasets. It validates data, saves the validation results, runs any Actions you have specified, and finally generates Data Docs that provide detailed documentation of the validation outcomes.

Creating a Checkpoint

To assist you with creating Checkpoints, the Great Expectations CLI again has a convenience method that will open a Jupyter Notebook with the boilerplate code you need, to easily configure and save your Checkpoint. Simply run the following CLI command from your Data Context:

great_expectations checkpoint new <checkpoint_name>

This will open a new notebook that contains cells for creating, customizing, testing, reviewing, and saving the checkpoint to your checkpoint store. Customize your configuration for a specific case.

For example, for the npi_small_file.critical suite, we will create the checkpoint npi_small_file_checkpoint and use the following configuration:

name: npi_small_file_checkpoint
config_version: 1.0
class_name: Checkpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
validations:
  - batch_request:
      datasource_name: input_files
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: npi_small
      data_connector_query:
        index: -1
    expectation_suite_name: npi_small_file.critical
action_list:
    - name: store_validation_result
      action:
        class_name: StoreValidationResultAction
    - name: store_evaluation_params
      action:
        class_name: StoreEvaluationParametersAction
    - name: update_data_docs
      action:
        class_name: UpdateDataDocsAction

After executing the cells from the notebook, your checkpoint will be saved to your checkpoint store, which is, in our case, a folder inside the project — great_expectations/checkpoints/.

For the needs of this demo, in the same way, we will create 3 checkpoints: npi_small_db_checkpoint, npi_small_file_checkpoint, and providers_by_state_checkpoint.

Running a Checkpoint

Running the checkpoint once it is fully set up is very straightforward. It can be done either from the CLI or with a Python script. One way is to generate the Python script by running the CLI command:

great_expectations checkpoint script npi_small_file_checkpoint

After the command runs, you will see a message about where the Python script was created:

- The script is located in `great_expectations/uncommitted/run_npi_small_file_checkpoint.py`
- The script can be run with `python great_expectations/uncommitted/run_npi_small_file_checkpoint.py`

Alternatively, the Python code that is now created inside the location provided in the previous message can be embedded in the pipeline.

Since we are using Airflow to automate the whole pipeline, there will be no need to create the script in any way. We will just use GreatExpectationsOperator. To use it, we only need Data Context and Checkpoint, which we have already created. For example, it can look like this:

task_validate_source_data = GreatExpectationsOperator(
  task_id="task_validate_source_data",
  data_context_root_dir=great_expectations_context_path,
  checkpoint_name="npi_small_file_checkpoint",
  return_json_dict=True,
  dag=dag
)

In the same way, we will create the other two Airflow tasks that use GreatExpectationsOperator for validating data.

Step 6: Running the demo

To run the Great Expectations demo, simply follow the steps outlined in this blog post. If you want to see the code for yourself, check out the GitHub repository. By following the steps exactly as written, you’ll be able to replicate the repository and get started with Great Expectations.

You can run each individual task in the airflow tasks test pipeline_with_gx <task_name>. In order to run the entire DAG, use airflow dags test pipeline_with_gx.

New HTML files will be created for each execution of a validation task, and they will be saved in the great_expectations/uncommitted/data_dos/validations/.

Here you can see an example of the validation result for npi_small_file_checkpoint:

Benefits of using Great Expectations

As we have seen in the previous chapters, Great Expectations offers several benefits that make it a great tool for data engineers and scientists. In this chapter, we will mention some of these benefits.

One of the significant advantages of using Great Expectations is the amount of time it saves. Traditional methods of data validation are usually time-consuming and require manual intervention, which can lead to errors and delays. With Great Expectations, you can automate the validation process and ensure that your data is always accurate and reliable. This not only saves time but also reduces the risk of errors.

Another benefit of Great Expectations is its flexibility. The tool can be used with a wide range of data sources and platforms. This means that you can use it with different databases and file formats, making it easier for data scientists and engineers to work with. Additionally, Great Expectations integrates with popular data science tools such as Jupyter Notebooks and Apache Spark, making it easy to incorporate into your existing workflow. This allows you to build more robust data pipelines and analyze data more efficiently.

Finally, Great Expectations provides a high level of transparency and visibility into your data pipelines. The tool generates detailed reports and alerts that allow you to quickly identify and address issues with your data. This means that you can easily troubleshoot your data pipelines and make changes as needed to ensure that your data is always accurate and reliable.

Conclusion

Great Expectations is a powerful tool that has big potential in the field of data quality validation and testing. It helps data teams maintain the quality of their data inside their pipelines. By automating the validation of your data, you can save time, reduce errors, and ensure that your data is always accurate and reliable. Although it is still in its early stages, with the current version being 0.16.8 at the time of writing this blog post, ongoing development and changes are taking place to improve its functionality, documentation, and user interface.

As demonstrated in this blog post, Great Expectations can be configured for various use cases, making it a versatile tool. Its open-source nature, ease of use, and flexibility make it a promising preventative measure and insurance against bad data flowing downstream, addressing data quality issues effectively. As Great Expectations continues to evolve, we can expect it to become a go-to tool for data quality validation and testing.