What’s this
In researching tools to perform data quality control as part of data management, I came across a data quality control OSS called Great Expectations and tried the tutorial.
What’s Great Expectations
An OSS that can define data quality, test if the definition is met, document test results, and perform automated profiling.
Tutorial
Based on the tutorial, perform data quality testing against BigQuery tables.
preparation
create tables
customer table was created in BigQuery using the data here.
And, The following SQL will create a customer_error table with application_date set to null.
create or replace table `{dataset_id}.customer_error`
asSELECT
customer_id,
customer_name,
gender_cd,
gender,
birth_day,
age,
postal_cd,
address,
application_store_cd,
Null as application_date,
status_cd
FROM `{dataset_id}.customer`
build docker
I prepared the following Docker.
To validate BigQuery tables, gcloud must be installed in the environment. docker-compose and Dockerfile must be on the same level.
The local authentication to GCP should be done in advance.
gcloud auth application-default login
dockerfile:
FROM python:3.9-slimRUN apt-get update -y && \
apt-get install --no-install-recommends -y -q \
git libpq-dev python3-dev build-essential && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*RUN pip install --upgrade pip && \
# version 0.14.10
pip install great_expectations && \
pip install sqlalchemy==1.4.25ENV PYTHONIOENCODING=utf-8
ENV LANG C.UTF-8
docker-compose.yml
version: '3.9'services:
great_expectations:
container_name: great_expectations
build: .
ports:
- "8888:8888"
tty: true
working_dir: /usr/app
volumes:
- ./scr:/usr/app
- gcloud-config:/root/.config
gcloud:
container_name: gcloud
entrypoint: "gcloud"
image: google/cloud-sdk:alpine
volumes:
- gcloud-config:/root/.config
volumes:
gcloud-config:
init project
Create a project with the following command.
great_expectations init
If successful, the required hierarchy and files are generated.
the definitions are here.
connect datasource
In this case, we want to connect to BigQuery, so we configure as follows.
Without — no-jupyter, jupyter will start up automatically. I wanted to specify ip, root settings, etc., so I specified — no-jupyter and ran it.
root@7ea61c3fa821:/usr/app# great_expectations datasource new --no-jupyter
Using v3 (Batch Request) APIWhat data would you like Great Expectations to connect to?
1. Files on a filesystem (for processing with Pandas or Spark)
2. Relational database (SQL)
: 2Which database backend are you using?
1. MySQL
2. Postgres
3. Redshift
4. Snowflake
5. BigQuery
6. other - Do you have a working SQLAlchemy connection string?
: 5To continue editing this Datasource, run jupyter notebook /usr/app/great_expectations/uncommitted/datasource_new.ipynb
Execute the following command to launch jupyter, access the link displayed, and open datasource_new.ipynb.
root@7ea61c3fa821:/usr/app# jupyter notebook /usr/app/great_expectations/uncommitted/datasource_new.ipynb --allow-root --ip=0.0.0.0
Run all cells with the following changes.
datasource_name = "{datasource_name}"
connection_string = "bigquery://{project_id}/{dataset_id}"
If it is done without error, the data source should have been successfully registered.
When I went to great_expectations.yml, the registered data source was present.
datasources:
"{datasource_name}":
module_name: great_expectations.datasource
class_name: Datasource
data_connectors:
default_runtime_data_connector_name:
module_name: great_expectations.datasource.data_connector
class_name: RuntimeDataConnector
batch_identifiers:
- default_identifier_name
default_inferred_data_connector_name:
include_schema_name: true
module_name: great_expectations.datasource.data_connector
class_name: InferredAssetSqlDataConnector
execution_engine:
module_name: great_expectations.execution_engine
class_name: SqlAlchemyExecutionEngine
connection_string = "bigquery://{project_id}/{dataset_id}"
config_variables_file_path: uncommitted/config_variables.yml
create suite
A Suite defines a combination of Expectations.
In creating a Suite, several built-in Profilers can be used.
root@7ea61c3fa821:/usr/app# great_expectations suite new --no-jupyter
Using v3 (Batch Request) APIHow would you like to create your Expectation Suite?
1. Manually, without interacting with a sample batch of data (default)
2. Interactively, with a sample batch of data
3. Automatically, using a profiler
: 3A batch of data is required to edit the suite - let's help you to specify it.Select a datasource
1. sasakky_bigquery
: 1Which data asset (accessible by data connector "default_inferred_data_connector_name") would you like to use?
1. {データセット名}.customer
2. {データセット名}.customer_errorType [n] to see the next page or [p] for the previous. When you're ready to select an asset, enter the index.
: 1
The Profiler needs a data set for analysis, and here it asks which data to use for this purpose.
Since the correct data is customer and what we want to validate this time is customer_error, we will analyze customer here and create a Suite.
Name the new Expectation Suite [{データセット名}.customer.warning]: customer_suiteGreat Expectations will create a notebook, containing code cells that select from available columns in your dataset and
generate expectations about them to demonstrate some examples of assertions you can make about your data.When you run this notebook, Great Expectations will store these expectations in a new Expectation Suite "customer_suite" here:file:///usr/app/great_expectations/expectations/customer_suite.jsonWould you like to proceed? [Y/n]: Y
If you name the Suite to be created this time and select Yes to the questions that follow, a JSON file of the Suite will be created in expectaions/.
{
"data_asset_type": null,
"expectation_suite_name": "customer_suite",
"expectations": [],
"ge_cloud_id": null,
"meta": {
"citations": [
{
"batch_request": {
"data_asset_name": "{dataset_name}.customer",
"data_connector_name": "default_inferred_data_connector_name",
"datasource_name": "sasakky_bigquery",
"limit": 1000
},
"citation_date": "2022-03-19T14:40:37.902540Z",
"comment": "Created suite added via CLI"
}
],
"great_expectations_version": "0.14.10"
}
}
create expectations
Edit and run edit_customer_suite.ipynb to create Expectations.
Here we use the Python API of Great Expectations via Jupyter Notebook.
The first cell imports various libraries, loads data and creates Validation.
import datetimeimport pandas as pdimport great_expectations as ge
import great_expectations.jupyter_ux
from great_expectations.core.batch import BatchRequest
from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler
from great_expectations.checkpoint import SimpleCheckpoint
from great_expectations.exceptions import DataContextErrorcontext = ge.data_context.DataContext()batch_request = {'datasource_name': 'sasakky_bigquery', 'data_connector_name': 'default_inferred_data_connector_name', 'data_asset_name': '{dataset_name}.customer', 'limit': 1000}expectation_suite_name = "customer_suite"validator = context.get_validator(
batch_request=BatchRequest(**batch_request),
expectation_suite_name=expectation_suite_name
)
column_names = [f'"{column_name}"' for column_name in validator.columns()]
print(f"Columns: {', '.join(column_names)}.")
validator.head(n_rows=5, fetch_all=False)
In the second cell, select the columns that do not need to be validated.
Here we want to check application_date, so comment it out.
ignored_columns = [
"customer_id",
"customer_name",
"gender_cd",
"gender",
"birth_day",
"age",
"postal_cd",
"address",
"application_store_cd",
# "application_date",
"status_cd",
]
Instantiate UserConfigurableProfiler and create a Suite. Pass some customized parameters.
profiler = UserConfigurableProfiler(
profile_dataset=validator,
excluded_expectations=["expect_table_columns_to_match_ordered_list","expect_column_values_to_be_in_set","expect_table_row_count_to_be_between","expect_column_proportion_of_unique_values_to_be_between","expect_column_values_to_be_in_type_list"],
ignored_columns=ignored_columns,
not_null_only=False,
primary_or_compound_key=["customer_id"],
semantic_types_dict=None,
table_expectations_only=False,
value_set_threshold="many",
)
suite = profiler.build_suite()
The last cell saves the Suite to disk, performs Validation on the loaded data, and creates Data Docs.
print(validator.get_expectation_suite(discard_failed_expectations=False))
validator.save_expectation_suite(discard_failed_expectations=False)checkpoint_config = {
"class_name": "SimpleCheckpoint",
"validations": [
{
"batch_request": batch_request,
"expectation_suite_name": expectation_suite_name
}
]
}
checkpoint = SimpleCheckpoint(
f"_tmp_checkpoint_{expectation_suite_name}",
context,
**checkpoint_config
)
checkpoint_result = checkpoint.run()context.build_data_docs()validation_result_identifier = checkpoint_result.list_validation_result_identifiers()[0]
context.open_data_docs(resource_identifier=validation_result_identifier)
validation
Run Validation to create Checkpoint.
Create the following command in the terminal.
great_expectations checkpoint new checkpoint_customer
Edit the Jupyter Notebook that is up and running.
my_checkpoint_name = "checkpoint_customer" # This was populated from your CLI command.yaml_config = f"""
name: {my_checkpoint_name}
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
validations:
- batch_request:
datasource_name: sasakky_bigquery
data_connector_name: default_inferred_data_connector_name
data_asset_name: {dataset_name}.customer_error -- edit here
data_connector_query:
index: -1
expectation_suite_name: customer_suite
"""
print(yaml_config)
Also, uncomment out the following.
context.run_checkpoint(checkpoint_name=my_checkpoint_name)
context.open_data_docs()
Once all cells are executed, Data Docs are created.
The application_date column, which is not allowed to be null, is failing if it is 100% null.
The unique key constraint for customer_id specified in PK is passed.