Data quality -Relevance in modern day data engineering pipelines

Iswarya K
Litmus7 Systems Consulting
7 min readJul 14, 2022

Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so data must be broken down, analyzed for it to have value.” –Clive Humby, 2006

Often times we have seen that data contains inaccurate information, missing information, is out of bounds or doesn’t match the expected semantics, doesn’t represent the population that it is intended to represent [1]. However, having bad data in our systems becomes quite detrimental as this is used as the “raw ingredient” for all the analytics, machine learning and AI based decisioning systems that enterprises use. Due to this, the emphasis on data quality is more important now than ever before. Data quality should ideally be the core component of Data Ops or any data engineering pipeline.

Gartner predicts that by 2022, 70% of organizations will rigorously track data quality levels via metrics, improving it by 60% to significantly reduce operational risks and costs. As per Gartner’s 2020 Magic Quadrant for Data Quality Solutions, poor data quality costs organizations an average of $12.8 million every year [2]

How do we measure data quality ?

So what exactly is quality data, how do we assess whether the data that we have is of good quality. At a very high level data is of good quality if the consumers of it can carry out their activities and achieve their intended goals with satisfaction and ease; also better the quality and better the value that can be driven from the data. However, rather than the consumers identifying data issues, how do we, as data engineers and data architects assess data quality. For this there are 5 main criteria that can be used to measure data quality [3][4]

  • Accuracy : Is the data correct, does it represent real world scenarios?
  • Relevancy : Do we need this information and meet requirements?
  • Completeness : Is the data comprehensive and cover all necessary details?
  • Timeliness : How up to date the information is?
  • Consistency : Is the data cross referenceable?

In line with the “12 Actions to Improve DQ” by Gartner [2], these are some of the steps that a data engineering team can adopt to measure, alert and action on data quality early in the pipeline.

Data profiling and control of incoming data

  • Data format and data patterns
  • Data consistency
  • Data value distributions and anomalies
  • Completeness of the data

Careful data pipeline design to avoid duplicate data

Enforcement of Data integrity

Integration of data lineage into the data pipeline

With this premise we come to the crux of the blog. We were looking at some of the new open source data quality products and packages available and couple of the ones of interest were Great Expectations and Deequ. There are others like Flyte and Pandera.

We have tried out two of these — Great Expectations with Python and PySpark and Deequ in PySpark on some Kaggle datasets (infused with some incorrect and inaccurate data). In this blog we will cover Great Expectations and how to use it in a Python-based pipeline.

What is Great Expectations?

Great Expectations is a data quality tool and validation tool for automated data profiling and identifying issues with the data with respect to certain “expectations” or constraints that we set. Great expectations also helps with regular data documentation and generally it is not updated regularly by many teams which makes the data documentation outdated and this is solved by Great Expectation. It also has various functions for data validation.

Great Expectations can be used both with Python and PySpark but it works great with Python.

Integrations

Supported Data sources — Pandas, Spark data frames, SQL databases via SQLAlchemy.

It allows us to store the relevant metadata related to expected data type of the column, expectations i.e. the rules we want to execute on each of the columns, validation results in File systems, database backends, as well as cloud storage such as S3 and Google Cloud Storage, by configuring metadata stores.

Key features:

Expectations -These are rules or constraints that we want to define about our data. In the backend, these assertions are declared in an expression language which are simple python methods.

Automated data profiling — Great Expectations makes it easier to write pipeline tests by providing automated data profiling. This library profiles the data to obtain basic statistics about data and automatically generates an expectation suite based on the observed data.

Data validation — Once the expectations are created, data can be loaded in a batch or several batches to validate the expectation suite. Great Expectations produces output which tells whether each expectation in the expectation suite has passed or failed and returns unexpected values which resulted in test failure. This feature helps in debugging data issues.

Data docs — Great Expectations delivers Expectations in clean, human-readable documentation which is called as Data Docs. These HTML docs contain both expectation suites and the data validation results.

Great expectations can be installed by using pip install great_expectations

To import the library use

import great_expectations as ge

Let’s create a data frame using the Supermarket Sales data

import pandas as pdsales_ge=ge.read_csv("supermarket_sales_1.csv")sales_ge.head()

If we want to check whether the source columns are as expected we can define the expected number of columns and column name in the Meta data and then write a rule in Great Expectations as below :

sales_ge.expect_table_columns_to_match_ordered_list(column_list=['Invoice ID', 'Branch', 'City', 'Customer type', 'Gender', 'Product line', 'Unit price', 'Quantity', 'Tax 5%', 'Total', 'Date', 'Time', 'Payment', 'cogs', 'gross margin percentage', 'gross income', 'Rating'])

Output:

The function of Great Expectations returns an output in JSON format. In this case we should look for the success, whether it is true or false. If it’s false we need to look for the reasons.

We can do a record count check . Consider, we expect the row count to be between 1 and 1000. Let’s check the data frame that matches our expectations

sales_ge.expect_table_row_count_to_be_between(min_value=1,max_value=1000)

Output:

Here we have received 1006 records in the source where as we have given range to be 1 to 1000. Hence you see in the output “Success” is false and the observed row count is 1006 and it is greater than the maximum value.

We can do unique check. In our data set let us assume the “Invoice ID” column should always be unique and duplicate values might create problems. We can check the uniqueness of the column by

sales_ge.expect_column_values_to_be_unique(column='Invoice ID',result_format="BOOLEAN_ONLY")

Output:

Here we have infused the dataset with some duplicate records. Hence you can see that “Success” is false.

The output result that we see above is a default view which just lets us know if the test or the expectation that we set has succeeded or not. However if we want to know further details then

Great Expectations supports four values for result_format:

  • BOOLEAN_ONLY
  • BASIC
  • SUMMARY
  • COMPLETE

Each successive value includes more detail and so can support different use cases for working with Great Expectations, including interactive exploratory work and automatic validation.

To get the detailed information in the output, we can change the value of result_format=’’COMPLETE” .Result is generated with all available justification for why an expectation was met or not.

sales_ge.expect_column_values_to_be_unique(column='Invoice ID',result_format="COMPLETE")

Note: Output is not displayed as its lengthy.

Most of the commonly occurring checks in a data engineering pipeline are available within GE , some of them are as below :

(1) We can also check if the Invoice ID column matches the given regular expression or not

sales_ge.expect_column_values_to_match_regex(column='Invoice ID', regex='[0-9]{3}-[0-9]{2}-[0-9]{4}',result_format="COMPLETE")

From the partial unexpected list, we can identify the invoice ids which does not follow the pattern.

(2) We can check if the values in a categorical column are in a given value set

sales_ge.expect_column_values_to_be_in_set(column='Branch', value_set=['A', 'B', 'C'],result_format="SUMMARY")

(3) Missing values can reduce the statistical power of the study and it can lead to improper decisions. We can check the missing values for any columns using the function available in great expectations.

sales_ge.expect_column_values_to_not_be_null(column='Customer type')

(4) We can also check the column entries to be strings with a length between the minimum and maximum value.

sales_ge.expect_column_value_lengths_to_be_between(column="Customer type",max_value=10,result_format="COMPLETE")

Great expectation is a handy tool when it comes to analysing and profiling the data and we have covered some basics into how to leverage it within a python based pipeline. In our next blog we will cover how Great Expectation can be used in a Spark environment. Spark being increasingly the choice of technology in modern day data engineering and machine learning pipelines. To explore more on various functions of Great Expectations please look into

References:

(1) https://towardsdatascience.com/is-data-really-the-new-oil-in-the-21st-century-17d014811b88#:~:text=%E2%80%9CData%20is%20the%20new%20oil,Clive%20Humby%2C%202006

(2) https://www.gartner.com/smarterwithgartner/how-to-improve-your-data-quality

(3) https://towardsdatascience.com/7-steps-to-ensure-and-sustain-data-quality-3c0040591366

(4) https://www.precisely.com/blog/data-quality/5-characteristics-of-data-quality

(5) https://www.gartner.com/smarterwithgartner/how-to-improve-your-data-quality

--

--