How To Use Great Expectations With BigQuery

Sasakky
5 min readMay 7, 2022

--

What’s this

In researching tools to perform data quality control as part of data management, I came across a data quality control OSS called Great Expectations and tried the tutorial.

What’s Great Expectations

An OSS that can define data quality, test if the definition is met, document test results, and perform automated profiling.

https://greatexpectations.io/

Tutorial

Based on the tutorial, perform data quality testing against BigQuery tables.

preparation

create tables

customer table was created in BigQuery using the data here.
And, The following SQL will create a customer_error table with application_date set to null.

create or replace table `{dataset_id}.customer_error`
as
SELECT
customer_id,
customer_name,
gender_cd,
gender,
birth_day,
age,
postal_cd,
address,
application_store_cd,
Null as application_date,
status_cd
FROM `{dataset_id}.customer`

build docker

I prepared the following Docker.
To validate BigQuery tables, gcloud must be installed in the environment. docker-compose and Dockerfile must be on the same level.
The local authentication to GCP should be done in advance.

gcloud auth application-default login

dockerfile:

FROM python:3.9-slimRUN apt-get update -y && \
apt-get install --no-install-recommends -y -q \
git libpq-dev python3-dev build-essential && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
RUN pip install --upgrade pip && \
# version 0.14.10
pip install great_expectations && \
pip install sqlalchemy==1.4.25
ENV PYTHONIOENCODING=utf-8
ENV LANG C.UTF-8

docker-compose.yml

version: '3.9'services:
great_expectations:
container_name: great_expectations
build: .
ports:
- "8888:8888"
tty: true
working_dir: /usr/app
volumes:
- ./scr:/usr/app
- gcloud-config:/root/.config
gcloud:
container_name: gcloud
entrypoint: "gcloud"
image: google/cloud-sdk:alpine
volumes:
- gcloud-config:/root/.config
volumes:
gcloud-config:

init project

Create a project with the following command.

great_expectations init

If successful, the required hierarchy and files are generated.
the definitions are here.

connect datasource

In this case, we want to connect to BigQuery, so we configure as follows.
Without — no-jupyter, jupyter will start up automatically. I wanted to specify ip, root settings, etc., so I specified — no-jupyter and ran it.

root@7ea61c3fa821:/usr/app# great_expectations datasource new --no-jupyter
Using v3 (Batch Request) API
What data would you like Great Expectations to connect to?
1. Files on a filesystem (for processing with Pandas or Spark)
2. Relational database (SQL)
: 2
Which database backend are you using?
1. MySQL
2. Postgres
3. Redshift
4. Snowflake
5. BigQuery
6. other - Do you have a working SQLAlchemy connection string?
: 5
To continue editing this Datasource, run jupyter notebook /usr/app/great_expectations/uncommitted/datasource_new.ipynb

Execute the following command to launch jupyter, access the link displayed, and open datasource_new.ipynb.

root@7ea61c3fa821:/usr/app# jupyter notebook /usr/app/great_expectations/uncommitted/datasource_new.ipynb --allow-root --ip=0.0.0.0

Run all cells with the following changes.

datasource_name = "{datasource_name}"
connection_string = "bigquery://{project_id}/{dataset_id}"

If it is done without error, the data source should have been successfully registered.
When I went to great_expectations.yml, the registered data source was present.

datasources:
"{datasource_name}":
module_name: great_expectations.datasource
class_name: Datasource
data_connectors:
default_runtime_data_connector_name:
module_name: great_expectations.datasource.data_connector
class_name: RuntimeDataConnector
batch_identifiers:
- default_identifier_name
default_inferred_data_connector_name:
include_schema_name: true
module_name: great_expectations.datasource.data_connector
class_name: InferredAssetSqlDataConnector
execution_engine:
module_name: great_expectations.execution_engine
class_name: SqlAlchemyExecutionEngine
connection_string = "bigquery://{project_id}/{dataset_id}"
config_variables_file_path: uncommitted/config_variables.yml

create suite

A Suite defines a combination of Expectations.
In creating a Suite, several built-in Profilers can be used.

root@7ea61c3fa821:/usr/app# great_expectations suite new --no-jupyter
Using v3 (Batch Request) API
How would you like to create your Expectation Suite?
1. Manually, without interacting with a sample batch of data (default)
2. Interactively, with a sample batch of data
3. Automatically, using a profiler
: 3
A batch of data is required to edit the suite - let's help you to specify it.Select a datasource
1. sasakky_bigquery
: 1
Which data asset (accessible by data connector "default_inferred_data_connector_name") would you like to use?
1. {データセット名}.customer
2. {データセット名}.customer_error
Type [n] to see the next page or [p] for the previous. When you're ready to select an asset, enter the index.
: 1

The Profiler needs a data set for analysis, and here it asks which data to use for this purpose.
Since the correct data is customer and what we want to validate this time is customer_error, we will analyze customer here and create a Suite.

Name the new Expectation Suite [{データセット名}.customer.warning]: customer_suiteGreat Expectations will create a notebook, containing code cells that select from available columns in your dataset and
generate expectations about them to demonstrate some examples of assertions you can make about your data.
When you run this notebook, Great Expectations will store these expectations in a new Expectation Suite "customer_suite" here:file:///usr/app/great_expectations/expectations/customer_suite.jsonWould you like to proceed? [Y/n]: Y

If you name the Suite to be created this time and select Yes to the questions that follow, a JSON file of the Suite will be created in expectaions/.

{
"data_asset_type": null,
"expectation_suite_name": "customer_suite",
"expectations": [],
"ge_cloud_id": null,
"meta": {
"citations": [
{
"batch_request": {
"data_asset_name": "{dataset_name}.customer",
"data_connector_name": "default_inferred_data_connector_name",
"datasource_name": "sasakky_bigquery",
"limit": 1000
},
"citation_date": "2022-03-19T14:40:37.902540Z",
"comment": "Created suite added via CLI"
}
],
"great_expectations_version": "0.14.10"
}
}

create expectations

Edit and run edit_customer_suite.ipynb to create Expectations.
Here we use the Python API of Great Expectations via Jupyter Notebook.

The first cell imports various libraries, loads data and creates Validation.

import datetimeimport pandas as pdimport great_expectations as ge
import great_expectations.jupyter_ux
from great_expectations.core.batch import BatchRequest
from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler
from great_expectations.checkpoint import SimpleCheckpoint
from great_expectations.exceptions import DataContextError
context = ge.data_context.DataContext()batch_request = {'datasource_name': 'sasakky_bigquery', 'data_connector_name': 'default_inferred_data_connector_name', 'data_asset_name': '{dataset_name}.customer', 'limit': 1000}expectation_suite_name = "customer_suite"validator = context.get_validator(
batch_request=BatchRequest(**batch_request),
expectation_suite_name=expectation_suite_name
)
column_names = [f'"{column_name}"' for column_name in validator.columns()]
print(f"Columns: {', '.join(column_names)}.")
validator.head(n_rows=5, fetch_all=False)

In the second cell, select the columns that do not need to be validated.
Here we want to check application_date, so comment it out.

ignored_columns = [
"customer_id",
"customer_name",
"gender_cd",
"gender",
"birth_day",
"age",
"postal_cd",
"address",
"application_store_cd",
# "application_date",
"status_cd",
]

Instantiate UserConfigurableProfiler and create a Suite. Pass some customized parameters.

profiler = UserConfigurableProfiler(
profile_dataset=validator,
excluded_expectations=["expect_table_columns_to_match_ordered_list","expect_column_values_to_be_in_set","expect_table_row_count_to_be_between","expect_column_proportion_of_unique_values_to_be_between","expect_column_values_to_be_in_type_list"],
ignored_columns=ignored_columns,
not_null_only=False,
primary_or_compound_key=["customer_id"],
semantic_types_dict=None,
table_expectations_only=False,
value_set_threshold="many",
)
suite = profiler.build_suite()

The last cell saves the Suite to disk, performs Validation on the loaded data, and creates Data Docs.

print(validator.get_expectation_suite(discard_failed_expectations=False))
validator.save_expectation_suite(discard_failed_expectations=False)
checkpoint_config = {
"class_name": "SimpleCheckpoint",
"validations": [
{
"batch_request": batch_request,
"expectation_suite_name": expectation_suite_name
}
]
}
checkpoint = SimpleCheckpoint(
f"_tmp_checkpoint_{expectation_suite_name}",
context,
**checkpoint_config
)
checkpoint_result = checkpoint.run()
context.build_data_docs()validation_result_identifier = checkpoint_result.list_validation_result_identifiers()[0]
context.open_data_docs(resource_identifier=validation_result_identifier)

validation

Run Validation to create Checkpoint.
Create the following command in the terminal.

great_expectations checkpoint new checkpoint_customer

Edit the Jupyter Notebook that is up and running.

my_checkpoint_name = "checkpoint_customer" # This was populated from your CLI command.yaml_config = f"""
name: {my_checkpoint_name}
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
validations:
- batch_request:
datasource_name: sasakky_bigquery
data_connector_name: default_inferred_data_connector_name
data_asset_name: {dataset_name}.customer_error -- edit here
data_connector_query:
index: -1
expectation_suite_name: customer_suite
"""
print(yaml_config)

Also, uncomment out the following.

context.run_checkpoint(checkpoint_name=my_checkpoint_name)
context.open_data_docs()

Once all cells are executed, Data Docs are created.

The application_date column, which is not allowed to be null, is failing if it is 100% null.
The unique key constraint for customer_id specified in PK is passed.

--

--

Sasakky

Data Engineer, Data Architect and Data Analyst in D2C Startup in Tokyo