Introduction:
Data engineers build systems to collect, transform and deliver data for stakeholders to use. A data engineers role is critical to ensure that data being delivered fulfills the expectations for each stakeholder’s respective use cases whether that’s for analysis, machine learning or more. In this article, we will discuss data quality and useful methods you can use for validating the completeness of your data pipelines using great_expectations.
Data Quality:
In Data Governance: The Definitive Guide (2021), data quality is defined by three main attributes:
Accuracy: Factuality of data, does data contain duplicate values?
Completeness: Do all required fields contain expected/valid values?
Timeliness: Are records available in a timely fashion?
The main attribute we will focus on is completeness. This will involve validating whether the data fits the criteria it is intended for. The main checks to assess completeness of data using great_expectations will include, validating:
- row and column counts
- if column exists and column matching ordered list
- values in a column between a certain value
- range that the minimum and maximum values in a column are expected to be between
- expected string values in a column
- most common value in a given set
- expected values to be unique in a column (free of duplicates)
Installing great_expectations:
Create and activate a virtual environment. When it is activated install great_expectations.
Note: great_expectations works with python versions 3.7–3.10
pip3 install great_expectations
After great_expectations is installed inside the virtual environment, import it to a python file.
import great_expectations as ge
df = ge.read_csv("employee_data.csv")
df
The data we will work with is in employee_data.csv. It contains the following data:
great_expectation Methods:
Now that a virtual environment is running with great_expectations installed and the employee data is inside a dataframe, let’s begin exploring methods to validate the data.
(1) Row and column counts
If the column or row count differ from the expected count, it could indicate data corruption, incomplete data, or formatting issues. We can use the following method to validate what we expect the column count to be.
expect_table_column_count_to_be_between
expect_table_row_count_to_be_between
(2) If column exists and matching ordered list
If the expected column does not exist or the order of the columns does not match the expected order, it would make it difficult to identify patterns, trends, and anomalies. We can use the following methods to check if column exists and its matching order.
expect_column_to_exist
expect_table_columns_to_match_ordered_list
(3) Values in a column between a certain value
In a particular column, checking if values are between a certain value will identify outliers and ensure that the expected values fall in that range. great_expectations uses the following method to check if values fall within the entered range.
expect_column_values_to_be_between
(4) Range that the minimum and maximum values in a column are to be between
In a particular column, we may expect the lowest and highest values to be between a certain range. Not identifying the minimum and maximum values may result in outliers in our dataframe. great_expectations has two methods to verify the range of minimum and maximum values.
expect_column_min_to_be_between
expect_column_max_to_be_between
(5) Expected string values in a column
In a particular column, we may expect a column to contain a particular set of strings. The method below helps to identify whether string values exist within in the column.
expect_column_values_to_be_in_set
(6) Most common value in a given set
In a column, you may be expecting a value to be the most common value; the value that repeats the most. Here’s a way to check.
expect_column_most_common_value_to_be_in_set
(7) Expected column values to be unique (free of duplicates).
In a column, you may expect values to be unique, free of duplicates. Duplicate values in columns can distort analysis result or when ensuring primary keys don’t repeat multiple times. This is how we can utilize the method below.
expect_column_values_to_be_unique
Conclusion
Great_expectations provides a streamlined approach to validating the completeness of a dataframe. It is a valuable tool with in-built methods to ensure quality checks within data pipelines. Accordingly, it allows data engineers to ensure data being delivered to other stakeholders (i.e., data analysts, data scientists, AI/ML engineers, software developers) fits within the respective use case.
References:
(2023). great expectations — documentation. Great Expectations. Retrieved from https://medium.com/p/9043de754294/edit
Ashdown, J., Eryurek, E., Gilad, U., Kibunguchy-Grant, A & Lakshmanan, V. (2021). Data Governance — The Definitive Guide. O’Reilly. Retrieved from https://learning.oreilly.com/library/view/data-governance-the/9781492063483/