Data Validation, Documentation, and Profiling with Great Expectations

Seckin Dinc
9 min readMar 31, 2023

--

Photo by Anoir Chafik on Unsplash

When the organization is small, the impact of data quality issues often doesn’t cause big problems. The data team identifies the problems with a delay, data downtime becomes a norm which can be treated with a good bribe to the stakeholders such as a good lunch or coffee.

As the organization keeps growing and suddenly the data is being used by various stakeholders which the data team even didn’t meet before, dealing with the data quality issues becomes a major problem. Suddenly data teams can find themselves in situations where they look into the commercial impact of data quality issues like the puppies in the picture above.

In my “Timeless Obstacle for Data Products: Data Quality” article I categorized the data quality issues into two parts; known-unknowns and unknown-unknowns.

Unit testing and data validation are necessary components of data quality but they only cover the known-unknown issues. We can’t write unit tests or assertions for the things that we didn’t observe or expect.

Unknown-unknowns are the biggest data quality issues because we are unaware of them until something breaks. When it breaks, it is always big. To get prepared for the unpredicted, we need different tools to monitor our system and collect meta-data to establish a foundation on what is normal and abnormal that we should pay attention to.

While we are moving from data validation to data profiling to open the doors for data observability and reliability, we have to meet with the legend, “Great Expectations”.

What is the “Great Expectations” tool?

Great Expectations is the leading tool for validating, documenting, and profiling your data to maintain quality and improve communication between teams.

Great Expectations is not a built-in Python package. You need to install it within your terminal with the pip install great_expectations command.

Screenshot from https://docs.greatexpectations.io/docs/

What is Data Validation?

Data validation is the process of ensuring that data is accurate, complete, and consistent under predefined rules and criteria. Data validation is an important part of data management and is crucial for ensuring data accuracy and reliability. For detailed information and examples, you can check my article;

What is Data Profiling?

Data profiling is the process of examining and analyzing data from various sources to better understand its structure, content, and quality. It involves gathering statistical and descriptive information about data, such as data types, ranges, patterns, and relationships, and identifying data quality issues, such as completeness, accuracy, consistency, and integrity.

What is Data Documentation?

Data documentation refers to the process of creating and maintaining detailed information about data. It involves documenting various aspects of data, such as its source, format, structure, quality, and usage.

What Makes Great Expectations Different?

There are various reasons that Great Expectations is different from many libraries, tools, and vendors in the market. Here are my thoughts;

Strong Community and Collaborators

For me, this is the most important part of selecting an open-source project to implement in my operations. I saw many great ideas and libraries turned into dust in the last few years due to a lack of a continuously supportive community and collaborators. Great Expectations excels with its strong community and collaborators. A strong community means the library can evolve quicker than expected, more edge cases can be covered easily, bugs and hotfixes are developed in days, and the questions can be answered by many other people in a shorter time compared to a commercial product with 2–5 customer care people.

Extensive Documentation

When you are working with an open-source project, you agree on the fact that there will be no customer support or help center where you can raise a ticket to ask for support. In this regard, the documentation of the library defines how much autonomy you can have on your own. Of course, community support is important but autonomy brings speed. Great Expectations exceeds my expectations on the documentation. It covers from the basics to each data source connector details.

Covering a Wide Range of Problems

Various libraries and tools in the market focus on a single problem such as data validation or data profiling. I am not saying this is wrong. On the contrary, the more dedicated the solution is the more sufficient and to the point, it becomes for that problem. The surprising fact about Great Expectations is, it covers data validation, data reporting, and data profiling at the same time with the same quality or even better compared to other ones that only focus on a single problem at a time.

End-to-End Project with Great Expectations

Great Expectations is not a lightweight library that you can download and 5 minutes later you are ready for the deployment. To understand its capabilities and how it focuses on being an enterprise-level solution I will build a project from scratch.

In this project, I will use the NBFI Vehicle Loan repayment Dataset at Kaggle. The dataset contains training and test data sets which represents a great use case for us to build expectations and tests on the training data set and apply it to the testing data set.

Creating Data Context

A Data Context is the primary entry point for the Great Expectations deployment, with configurations and methods for all supporting components.

At our terminal, we can execute the command below to initialize our project configuration;

great_expectations init

After we run the command, we are welcomed with an easy to follow steps on how to create a project structure.

Image by the author

As we agreed to proceed Great Expectations automatically creates the project folder structure as below;

Image by the author

It can be confusing to understand the structure so let’s deep dive together;

  • checkpoints: A Checkpoint is the primary means for validating data in a production deployment of Great Expectations. A Checkpoint uses a Validator to run one or more Expectation Suites against one or more Batches provided by one or more Batch Requests. Running a Checkpoint produces Validation Results and will result in optional Actions being performed if they are configured to do so.
  • expectations: An Expectation is a verifiable assertion about data. Unlike traditional unit tests, Great Expectations applies Expectations to data instead of code. For example, you could define an Expectation that a column contains no null values, and Great Expectations would run that Expectation against your data, and report if a null value was found.
  • plugins: Plugins extend Great Expectations’ components and/or functionality by storing any custom plugin to be stored.
  • profilers: A Profiler generates Metrics and candidate Expectations from data.
  • great_expectations.yml: Contains the configuration of Great Expectations deployment.

Connecting Data

As I stated at the beginning of the project, I will have training and test files to be used. I will store them in the data folder.

Image by the author

For Great Expectations to access the data files we can execute the command below at our terminal;

great_expectations datasource new

Great Expectations tries to locate where the data is stored, how the data is going to be processed and folder information through the CLI as shown below.

Image by the author

After we give the requested input, Great Expectations automatically creates a Jupyter Notebook.

Image by the author

In this notebook, it is required from the user to share the datasource_name. The remaining steps are the ones to print out the yaml file and test the configurations. As seen below, the test step is successfully passed and yaml file can access the datasets.

Image by the author

Creating Expectations

To create the expectations, we need to create an Expectation Suite. An Expectation Suite is a collection of verifiable assertions about data. To create an Expectation Suite we can execute the command below at our terminal;

great_expectations suite new

Similar to the data source creation step, we are asked to follow a few steps to guide Great Expectations. I will create the expectation suite automatically as the initial step. Afterward, I will edit it manually.

Image by the author

After we give the requested input, Great Expectations automatically creates a Jupyter Notebook. Thanks to the long list of columns, I had to split the notebook into two images.

Image by the author
Image by the author

For the sake of simplicity, I will create expectations for Client_Income_Type and Own_House_Age columns. As we proceed with executing the lines, we get a report generated;

Image by the author

Let’s focus on the information in the left panel;

  • Actions: We can visualize either all expectations or the failed ones. If we want to edit the given expectations, we can run the command below in our terminal;

great_expectations suite edit train_dataset

  • Table of Contents: We generated table and column-level expectations. From this part, we can select which expectations to be viewed.

Editing Expectations

These expectations are generated automatically. Even though they are great to start with, it is not quite enough to proceed into production. In this stage, we can edit our suite either manually or interactively. I will proceed with the manual edit;

Image by the author

After we give the requested input, Great Expectations automatically creates a Jupyter Notebook. This time, it shows all the automatically generated expectations.

Image by the author

In this notebook, we can edit, add or delete the predefined expectations to update the Expectation Suite. If we don’t want to proceed with the notebooks, we can visit our project folder to use the json file.

Image by the author

Creating Checkpoints

Up to now, we connected to our train data set and created the expectations. The next thing we need to do is to apply the expectations to the test data set to validate our new data set. To do that we need to create a checkpoint to execute the created Expectation Suite on the new data set. We run the code below in our terminal;

great_expectations checkpoint new train_data_checkpoint

Great Expectations automatically creates a Jupyter Notebook. This time, it shows the checkpoint information and on which data we are going to apply this checkpoint.

Image by the author
Image by the author

Testing Test Data with Checkpoint

After we create the checkpoint, we are going to use it to validate the test data set. There are various ways to execute this step;

  • Use the Jupyter Notebook above and execute the last cell to create the report as below;
Image by the author
Image by the author
  • Use the terminal to execute the saved checkpoint with the code below;

great_expectations checkpoint run first_checkpoint

Image by the author

Conclusion

The Great Expectations is one of the most important libraries in the modern data stack. Especially for organizations where data quality, profiling, and documentation are crucial due to the continuously changing environment, it is inevitable to put Great Expectations at the heart of the operations.

Thanks a lot for reading 🙏

If you are interested in data quality topics, you can check my other articles;

If you are interested in data and leadership topics, you can check out my glossary for all other articles.

If you want to get in touch, you can find me on Linkedin and Mentoring Club!

--

--

Seckin Dinc

Building successful data teams to develop great data products