A Beginner Tutorial of Great Expectations

Great Expectation is a tool used to validate your data, check the quality of your data, documentation, and profiling of your data in the data engineering context.

Ketan Sahu
Nerd For Tech
4 min readJun 26, 2021

--

Photo by Mika Baumeister on Unsplash

Data quality is a common problem in the data science world among data engineers, analysts, machine learning engineers, scientists, and all the data end-users. If the data quality is terrible, an analyst might not create accurate analytics, and a machine learning model might behave differently than expected. If poor graphs/analytics are available to a management team, they might make poor decisions. Hence, it is vital to make sure the data we’re using to make decisions, create analytics, or develop a machine learning model is validated prior. If data have some redundancy, make sure they are known beforehand and documented.

A new tool discussed most in the data engineering ecosystem is Great Expectations to tackle all such issues. I recently looked into this tool and did try my hands on it. In this article, I will cover some basics with examples about Great Expectations.

While designing an ETL or ELT pipeline, many of us have faced this problem, where we always have to make sure the data we extracted from different sources hold accurate data. Some common problem might be,

🛑 A column might include an unexpected Null value or some unexpected value

🛑 The table might consist of an extra column or rows

🛑 The columns might order differently than expected in the table

🛑 The table might have multiple duplicate rows or columns.

I’ve just listed down very few issues, but plenty of data quality issues occur. The tool Great Expectations help us with the validation. The tool can also document the data validation and quality tests and their results and present it with a wonderful, easy-to-read user interface. The test in the Great Expectations is called the expectation test. Below, I’ve presented a simple test case scenario, implemented some expectation tests on it, and showed the generated document.

Test case scenario

Let’s consider the company XYZ.io has 20 employees total. The company has the only capacity of 3 departments (sales, marketing, and development). The employees only belong from 5 states (Alabama, Colorado, Delaware, Florida, and Washington), and their age is between 20 and 40. The employee's salary is between 80,000 to 130,000. Employees total prior experience and hired date. Customer ID, each employee is looking for currently. See Dataset below.

data by Author

Expectation Test

GistFile by Author
  1. expect_column_values_to_be_in_set — To check column only include expected values.
expect_column_values_to_be_in_set test by Author

In this test above, we have validated that the column employee_location only includes Alabama, Colorado, Delaware, Florida, and Washington. Hence, the success is showing true.

In the below example, we will see a failed scenario.

expect_column_values_to_be_in_set test by Author

In the above example, we tried to validate that the column department only includes values sales, marketing, and development. But the expectation produced a failed result. See, success is false.

As you can see, the expectation test validate the data with true and false. In this article, I’ve shown only a single example with expectation test. However, there are plenty of expectation tests can be performed to validate your data and check the quality of the data. In this GitHub Repository, you can see different types of expectations tests I performed on this data.

Important to remember, this article only focus on the basics of the Great Expectations. In the production environment, Great Expectations is set up differently. Check this article by Great Expectation team to understand how to install and setup Great Expectations.

If you like this article, please follow me on LinkedIn, Medium or Instagram. In case you want to know more about me, check out my website.

If you have some questions to discuss, you can write me a LinkedIn message. Happy to help you with my experience.

--

--