HashmapInc
Published in

HashmapInc

Understanding Great Expectations and How to Use It

A data validation use case for Snowflake

What is Great Expectations?

Great Expectations is a useful tool to profile, validate, and document data. It helps to maintain the quality of data throughout a data workflow and pipeline.

Used with a workflow orchestration service, Great Expectations can help accelerate a data solution project by catching data issues as soon as possible and notifying data engineers to fix the issues.

Why would I use Great Expectations?

For this section, I felt it’s best to pick the lines from docs.greatexpectations.io. Teams use Great Expectations to get more done with data, faster by:

  • Saving time during data cleaning and munging.
  • Accelerating ETL and data normalization.
  • Streamlining analyst-to-engineer handoffs.
  • Streamlining knowledge capture and requirements gathering from subject-matter experts.
  • Monitoring data quality in production data pipelines and data products.
  • Automating verification of new data deliveries from vendors and other teams.
  • Simplifying debugging data pipelines if (when) they break.
  • Codifying assumptions used to build models when sharing with other teams or analysts.
  • Developing rich shared data documentation in the course of normal work.
  • Making implicit knowledge explicit.

… and much more

Key features

  • Expectations are like assertions in traditional Python unit tests.
  • Automated data profiling automates pipeline tests.
  • Data Contexts and Data Sources allow you to configure connections to your data sources.
  • Tooling for validation are checkpoints for data validation.
  • Data Docs clean, human-readable documentation.

A Great Expectations Use Case

This is a use case on using Great Expectations to validate data in a data pipeline. To show the workflow, we have used the Python Framework Prefect. The various components in this use case are:

  • MySQL DB
  • Local FS
  • AWS S3 storage
  • Snowflake

Extract data from MySQL DB. Stage the datafiles in Local FS and run transformations. Stage the transformed datafiles in AWS S3 storage. Load the files in Snowflake.

The ultimate goal is to load valid data into Snowflake.

Let's Get started

Install the required packages

The first step is to install the required packages.

pip install great_expectations
pip install SQLAlchemy
pip install psycopg2-binary

Initialize a Great Expectations project

Next, let’s initialize a Great Expectations project.

great_expectations initopt for no datasource at this point.

Add the data Sources

Let’s add the four data sources, MySQL, filesystem, AWS S3, and Snowflake.

MySQL

Install MySQL required packages and add the data source:

pip install PyMySQLgreat_expectations datasource newWhat data would you like Great Expectations to connect to?
1. Files on a filesystem (for processing with Pandas or Spark)
2. Relational database (SQL)
: 2
Which database backend are you using?
1. MySQL
2. Postgres
3. Redshift
4. Snowflake
5. BigQuery
6. other - Do you have a working SQLAlchemy connection string?
: 1
Give your new Datasource a short name.
[my_mysql_db]:
Next, we will configure database credentials and store them in the `my_mysql_db` section
of this config file: great_expectations/uncommitted/config_variables.yml:
What is the host for the MySQL connection? [localhost]:
What is the port for the MySQL connection? [3306]:
What is the username for the MySQL connection? []:
What is the password for the MySQL connection?:
What is the database name for the MySQL connection? []:
Attempting to connect to your database. This may take a moment...
Great Expectations will now add a new Datasource 'my_mysql_db' to your deployment, by adding this entry to your great_expectations.yml: my_mysql_db:
credentials: ${my_mysql_db}
data_asset_type:
class_name: SqlAlchemyDataset
module_name: great_expectations.dataset
class_name: SqlAlchemyDatasource
module_name: great_expectations.datasource
The credentials will be saved in uncommitted/config_variables.yml under the key 'my_mysql_db'Would you like to proceed? [Y/n]:

Filesystem

Add filesystem as a data source.

pip install fsspecgreat_expectations datasource new
What data would you like Great Expectations to connect to?
1. Files on a filesystem (for processing with Pandas or Spark)
2. Relational database (SQL)
:1

What are you processing your files with?
1. Pandas
2. PySpark
: 1
Enter the path (relative or absolute) of the root directory where the data files are stored.
: C:\Users\xxx\PycharmProjects\great_expectation_poc\staging
Give your new Datasource a short name.
[staging__dir]:
Great Expectations will now add a new Datasource 'staging__dir' to your deployment, by adding this entry to your great_expectations.yml:

AWS S3

Add AWS S3 as a data source.

Install required s3fs package
pip install s3fs

Adding S3 as a data source is a little tricky. The best way is to follow the instructions in docs.greatexpectations.io. Below demonstrates how I added the source.

  1. Edit your great_expectations/great_expectations.yml file, update your data sources, and be sure the section includes a PandasDatasource.
datasources:
pandas_s3:
class_name: PandasDatasource

2. Load data from S3 using native S3 path-based Batch Kwargs. Because Pandas provides native support for reading from S3 paths, this simple configuration will allow loading data sources from S3 using native S3 paths.

context = DataContext()
batch_kwargs = {
"datasource": "pandas_s3",
"path": "s3a://my_bucket/my_prefix/key.csv",
}
batch = context.get_batch(batch_kwargs, "existing_expectation_suite_name")

Note: In the demonstration above, I ran in the Jupyter notebook generated by running.

great_expectations suite edit ge_mysql_demo_tbl.warningpaste the code in one of the cell and run just that cell.
Note: I added the datasource after creating the MYSQL expectation suite

Snowflake

Install Snowflake required packages and add the data source

pip install snowflake-connector-python
pip install snowflake-sqlalchemy
great_expectations datasource new
What data would you like Great Expectations to connect to?
1. Files on a filesystem (for processing with Pandas or Spark)
2. Relational database (SQL)
: 2
Which database backend are you using?
1. MySQL
2. Postgres
3. Redshift
4. Snowflake
5. BigQuery
6. other - Do you have a working SQLAlchemy connection string?
: 4
Give your new Datasource a short name.
[my_snowflake_db]: my_snowflake_db_new
Next, we will configure database credentials and store them in the `my_snowflake_db_new` section
of this config file: great_expectations/uncommitted/config_variables.yml:
What authentication method would you like to use?
1. User and Password
2. Single sign-on (SSO)
3. Key pair authentication
: 1
What is the user login name for the snowflake connection? []:
What is the account name for the snowflake connection (include region -- ex 'ABCD.us-east-1')? []:
What is database name for the snowflake connection? (optional -- leave blank for none) []:
What is schema name for the snowflake connection? (optional -- leave blank for none) []:
What is warehouse name for the snowflake connection? (optional -- leave blank for none) []:
What is role name for the snowflake connection? (optional -- leave blank for none) []:
What is the password for the snowflake connection?:
Attempting to connect to your database. This may take a moment...
Great Expectations will now add a new Datasource 'my_snowflake_db_new' to your deployment, by adding this entry to your great_expectations.yml: my_snowflake_db_new:
credentials: ${my_snowflake_db_new}
data_asset_type:
class_name: SqlAlchemyDataset
module_name: great_expectations.dataset
class_name: SqlAlchemyDatasource
module_name: great_expectations.datasource
The credentials will be saved in uncommitted/config_variables.yml under the key 'my_snowflake_db_new'Would you like to proceed? [Y/n]:

Create a Great Expectations Suite

Let’s create our first Expectations using suite scaffold. To create a new suite, we will use the scaffold command to automatically create an Expectations Suite called mytest.demo with a built-in profiler.

great_expectations suite scaffold ge_mysql_demo_tbl.warning
Heads up! This feature is Experimental. It may change. Please give us your feedback!
Select a datasource
1. my_mysql_db
2. my_snowflake_db
3. pandas_s3
4. staging__dir
: 1
Which table would you like to use? (Choose one)
1. test1 (table)
2. test2 (table)
Do not see the table in the list above? Just type the SQL query
: 1

We will create the expectations suite for a table in MySQL DB.

After selecting the table, Great Expectations will open a Jupyter notebook, which will take you through the next part of the scaffold workflow. Execute all the cells, and Great Expectations will open Data Docs in a browser window.

What just happened? … The automated Profiler connected to MySQL table data took a quick look at the contents and produced an initial set of Expectations.

By default, Expectation Suites are stored in a JSON file in a subdirectory of your great_expectations/ expectations folder to edit an expectation suite and open data docs.

great_expectations suite edit ge_mysql_demo_tbl.warning

To create custom Expectations, check docs.greatexpectations.io

Data Validation

Next, we will create checkpoints for each data source to use the expectations suite for validating data in FileSystem, AWS S3, and Snowflake.

Create a new checkpoint

great_expectations checkpoint new my_checkpoint ge_mysql_demo_tbl.warningHeads up! This feature is Experimental. It may change. Please give us your feedback!
Select a datasource
1. my_mysql_db
2. my_snowflake_db
3. pandas_s3
4. staging__dir
: 4
Would you like to:
1. choose from a list of data assets in this datasource
2. enter the path of a data file
: 2
Enter the path of a data file (relative or absolute, s3a:// and gs:// paths are ok too)
: C:\Users\xxxx\PycharmProjects\great_expectation_poc\staging\ge_mysql_demo_tbl_202101061644.csv
A checkpoint named `my_checkpoint` was added to your project!
- To edit this checkpoint edit the checkpoint file: C:\Users\18329\PycharmProjects\great_expectation_poc\great_expectations\checkpoints\my_checkpoint.yml
- To run this checkpoint run `great_expectations checkpoint run my_checkpoint`

I created checkpoints for all the data sources.

Run a checkpoint in the CLI

To validate the data run the checkpoints.

great_expectations checkpoint run my_checkpoint

Data Docs

Data Docs is an HTML documentation created automatically by Great Expectations. It displays expectations suites and validation results in a human-readable form.

Use the following CLI command to build the Data Docs site and provide the index page's path.

great_expectations docs build --site-name local_siteThe following Data Docs sites will be built: - local_site: file://C:\Users\xxxx\PycharmProjects\great_expectation_poc\great_expectations\uncommitted/data_docs/local_site/index.htmlWould you like to proceed? [Y/n]: YBuilding Data Docs...Done building Data Docs

Great Expectations in Prefect Workflow

Prefect is a Python framework. Like Airflow, it can be used to build, run, and monitor data workflows and pipelines.

Let's integrate Great Expectations checkpoint validations in a Prefect Workflow.

Add a Prefect task to run checkpoints

pip install prefectadd a python file great_expectations_flow.py
from prefect import Flow, Parameter, task, Client
from prefect.schedules import IntervalSchedule
from prefect.tasks.great_expectations import RunGreatExpectationsValidation
.....
def register_flow(flow):
client = Client()
client.create_project(project_name="great_expectation_poc")
flow.register(project_name="great_expectation_poc")
....
def build_flow():
with Flow("great expectations example flow") as flow:
....
....
# Local FS file data validation
checkpoint_name = Parameter("checkpoint_name")
ge_task = RunGreatExpectationsValidation(run_name="fs_stage_validation")
validations_1 = ge_task.map(checkpoint_name)
validations_1.set_upstream(fs_file)

return flow
if __name__ == "__main__":
flow = build_flow()
flow.run(
checkpoint_name=["staging.chk"],
checkpoint_name_2=["cloudstaging.chk"],
checkpoint_name_3=["destination.chk"]
)
register flow(flow)

Run Prefect Workflow

We will use server orchestration for our use case. The local server requires Docker and Docker Compose to be installed. If you have Docker Desktop on your machine, you have got both of these.


Step 1: Before running the server for the first time, run:
prefect backend server

Step 2: Next, to start the server, UI, and all required infrastructure, run:
prefect server start

Step 3: The executing flows from the server requires at least one Prefect Agent to be running:
prefect agent local start

Step 4: Register a flow:
cd great_expectations
python.exe C:/Users/xxxx/PycharmProjects/great_expectation_poc/great_expectations/great_expectations_flow.py

view the workflow in browser localhost:8080

Step 5: Register a flow:

Step 6: Execute a Flow Run
We will use the UI to execute a run. On the flow page click page click "Quick Run" in the upper-right corner.
Prefect workflow run schematic.

You can see that cloud checkpoint validation failed and stopped the rest of the pipeline run in the above image.

Miscellaneous

  • The data source configurations are stored in uncommitted\great_expectations.yml.
  • The data source secrets are stored in config_variables.yml. The default location of the file is uncommitted\config_variables.yml. The file can be stored in any location by updating the “config_variables_file_path” key value in uncommitted\great_expectations.yml.
  • The expectation suite JSON is in the expectations folder.
  • The checkpoint YAMLs are in the checkpoints folder.
  • The checkpoints and validations can be configured to store in any on-prem or cloud store.
  • The suites generated by the scaffold command are not meant to be production suites.
  • By default, Data Docs are stored locally, in an uncommitted directory. This is great for individual work but not good for collaboration. A better approach is to deploy to a cloud-based blob store (S3, GCS, or Azure blob store), configured to share a static website.
  • Great Expectations can be deployed with workflow orchestration services like Airflow, Google Cloud Composer, Astronomer.

Final Thoughts

Data Validation is becoming a part of data workflows and pipelines. Working with correct and valid data is an important criterion for the success of any data solution project.

Great Expectations is a great tool that can help save time and promote analytic integrity for your data pipelines.

Resources

Gitlab Mirror: https://gitlab.com/hashmapinc/oss/great-expectations-poc

Ready to Accelerate Your Digital Transformation?

At Hashmap, we work with our clients to build better, together.

If you are considering moving data and analytics products and applications to the cloud or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, then please contact us.

Hashmap, an NTT Data Company, offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud service offerings. We would be glad to work through your specific requirements.

Feel free to share on other channels, and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Other Tools and Content You Might Like

Jhimli Bora is a Cloud and Data Engineer with Hashmap, an NTT DATA Company, providing Data, Cloud, IoT, and AI/ML services and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Connect with her on LinkedIn.

--

--

--

Innovative technologists and domain experts helping accelerate the value of Data, Cloud, IIoT/IoT, and AI/ML for the community and our clients by creating smart, flexible and high-value solutions and service offerings that work across industries. http://hashmapinc.com

Recommended from Medium

My first full stack data science project (Part 5)…

Semantic building blocks

Definition of Cassandra Architecture with consistency level and Replication Factor(RF) simplified

I Used to Pay $180/yr for a Profitable Betting Tool. This Year I Built One in Python

code behind betpractice football betting tool

Stack, Queue and Deque Data Structures in Ruby

Ata 100 Chapters Pdf

TryHackMe — Thompson

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jhimli Bora

Jhimli Bora

More from Medium

What is Cost In Detail?| Interview Q&A

Stepping up Disaster Recovery with Amazon Data Lifecycle Manager

No Infrastructure! No Waste of Time! There is Amazon EMR!

Generalized Debugging Process