Great Expectations — A Python Library Built for Data Quality

Ali Anwar
5 min readAug 10, 2023

--

Introduction:

Data engineers build systems to collect, transform and deliver data for stakeholders to use. A data engineers role is critical to ensure that data being delivered fulfills the expectations for each stakeholder’s respective use cases whether that’s for analysis, machine learning or more. In this article, we will discuss data quality and useful methods you can use for validating the completeness of your data pipelines using great_expectations.

Data Quality:

In Data Governance: The Definitive Guide (2021), data quality is defined by three main attributes:

Accuracy: Factuality of data, does data contain duplicate values?

Completeness: Do all required fields contain expected/valid values?

Timeliness: Are records available in a timely fashion?

The main attribute we will focus on is completeness. This will involve validating whether the data fits the criteria it is intended for. The main checks to assess completeness of data using great_expectations will include, validating:

  • row and column counts
  • if column exists and column matching ordered list
  • values in a column between a certain value
  • range that the minimum and maximum values in a column are expected to be between
  • expected string values in a column
  • most common value in a given set
  • expected values to be unique in a column (free of duplicates)

Installing great_expectations:

Create and activate a virtual environment. When it is activated install great_expectations.

Note: great_expectations works with python versions 3.7–3.10

pip3 install great_expectations

After great_expectations is installed inside the virtual environment, import it to a python file.

import great_expectations as ge

df = ge.read_csv("employee_data.csv")
df

The data we will work with is in employee_data.csv. It contains the following data:

great_expectation Methods:

Now that a virtual environment is running with great_expectations installed and the employee data is inside a dataframe, let’s begin exploring methods to validate the data.

(1) Row and column counts

If the column or row count differ from the expected count, it could indicate data corruption, incomplete data, or formatting issues. We can use the following method to validate what we expect the column count to be.

expect_table_column_count_to_be_between

The observed value is 5, indicating a total of 5 columns in the dataframe. However, we expected the count of total columns to between 7 and 10, hence the success key produced a value of false.

expect_table_row_count_to_be_between

The observed value is 10, indicating a total of 10 rows in the dataframe. We expected the count of rows to be between 0 and 20, which is was, hence the success key produced a value of true.

(2) If column exists and matching ordered list

If the expected column does not exist or the order of the columns does not match the expected order, it would make it difficult to identify patterns, trends, and anomalies. We can use the following methods to check if column exists and its matching order.

expect_column_to_exist

The success key produced a value of true as the Name column exists in the dataframe.

expect_table_columns_to_match_ordered_list

The order of the columns are as follows: Employee ID, Name, Age, Role, Salary. Below you’ll see, if the order is switched, then it will produce a success value of false as this is not the order the columns follow.

(3) Values in a column between a certain value

In a particular column, checking if values are between a certain value will identify outliers and ensure that the expected values fall in that range. great_expectations uses the following method to check if values fall within the entered range.

expect_column_values_to_be_between

The success key returns a value of true. The values in the salary columns fall in between the min and max value.

(4) Range that the minimum and maximum values in a column are to be between

In a particular column, we may expect the lowest and highest values to be between a certain range. Not identifying the minimum and maximum values may result in outliers in our dataframe. great_expectations has two methods to verify the range of minimum and maximum values.

expect_column_min_to_be_between

The success key returns a value of false. The lowest value in the Age column is observed as 25. The min_value we provided was 18 and max_value was 25, hence the lowest value we expected the value to be was not within the range provided.

expect_column_max_to_be_between

The success key has a value of true. The highest value in the Salary column is observed as 187,000. The min_value we provided was 150,000 and max_value was 200,000, hence the highest value we expected the value to be was within the range provided.

(5) Expected string values in a column

In a particular column, we may expect a column to contain a particular set of strings. The method below helps to identify whether string values exist within in the column.

expect_column_values_to_be_in_set

The success key is true. Unexpected count of the dataframe was 2. This means that out of the 10 rows, 2 rows did not contain the values in the set. These values can be observed from the employee_data.csv data in the Role column for Procurement Analyst and Product Manager.

(6) Most common value in a given set

In a column, you may be expecting a value to be the most common value; the value that repeats the most. Here’s a way to check.

expect_column_most_common_value_to_be_in_set

The success key is false. It was expected Data Engineer was the most common value in the Role column, however it was Solutions Architect. Below, we added Solutions Architect to the set, hence why the success key value is true.

(7) Expected column values to be unique (free of duplicates).

In a column, you may expect values to be unique, free of duplicates. Duplicate values in columns can distort analysis result or when ensuring primary keys don’t repeat multiple times. This is how we can utilize the method below.

expect_column_values_to_be_unique

The success key is true. Each of the values in the Employee ID column are unique (don’t repeat more then once). Below, if we look at the Role column, the success key is false since values such as Data Engineer, Software Developer and Solutions Architect repeat more then once.

Conclusion

Great_expectations provides a streamlined approach to validating the completeness of a dataframe. It is a valuable tool with in-built methods to ensure quality checks within data pipelines. Accordingly, it allows data engineers to ensure data being delivered to other stakeholders (i.e., data analysts, data scientists, AI/ML engineers, software developers) fits within the respective use case.

References:

(2023). great expectations — documentation. Great Expectations. Retrieved from https://medium.com/p/9043de754294/edit

Ashdown, J., Eryurek, E., Gilad, U., Kibunguchy-Grant, A & Lakshmanan, V. (2021). Data Governance — The Definitive Guide. O’Reilly. Retrieved from https://learning.oreilly.com/library/view/data-governance-the/9781492063483/

--

--

Ali Anwar

Welcome, I'm Ali. I write about software development.