Level Up on Data Quality

Your Next Data Quality Solution Using Databricks & Great Expectations

Published in

Backstage Stories

10 min readMar 25, 2024

Welcome to another customer story! Join me on a journey filled with ideas, frustrations, learnings, and hopefully, insights you can apply in your own role.

This time, I will take us through my data quality learnings since I started working as a data engineer. Additionally, we will explore a data quality solution in Databricks utilizing the open-source library, Great Expectations.

Topics to be covered:

Standard Data Quality Checks
Data Quality Checks From Downstream Users
Data Quality Does Not Come With “Blind Trust”
Databricks & Great Expectations - Implementing Transparent Data Quality Solutions
The “Traffic Light” Solution
Automating Downstream Applications
Importance of KPIs for Top Management

In my role as a Data Engineer, one of my primary responsibilities is to deliver reliable and trustworthy data to downstream users. I have been dedicated to this mission since the beginning of my career. Today, I believe we are getting close to a very good solution.

Before going into the solution, I would like to provide some context by sharing my journey and, most importantly, the reasons behind my perspective on today’s data quality challenges.

Learning the Data Quality Checks from Source to Data Warehousing

My data quality learnings began when I started as an Associate Business Intelligence Consultant, working alongside my mentor and boss, Jakob Fløe Sørensen. He showed me how to become proficient in T-SQL and how to implement source data quality checks into our SSIS packages when extracting, transforming, and loading data from various sources into our customers’ SQL Data Warehouses.

These validation checks encompassed a spectrum of critical criteria, including the identification of primary keys, checking for duplicates and conducting not null checks. This marked the beginning of my data quality journey.

Learning Data Quality Checks Comes from Both Sides!

My next significant learning in the data quality landscape was realizing that data validation checks did not solely originate from the source side. I began receiving complaints about specific data quality issues, and suddenly, I found myself implementing data quality checks defined by downstream users. At this point, my validation checks came from both the source side and the downstream users.

Experiencing downstream users returning with complaints about data is extremely usefull. Oftentimes, these users have unique perspectives or requirements that may not have been initially considered in our data validation checks.

Allow me to share two examples:

1. On one occasion, the end user discovered the “gross_amount” column was less than the “net_amount” column. This anomaly was promptly reported by the downstream users.
2. Another case, less obvious, was that when the “status” column was marked as “Invoiced,” the “shipment_date” column could never be NULL.

As a data engineer, accommodating and implementing both these rules in the validation checks for the specific tables is crucial to ensure data quality and integrity.

The End User Does Not Work With Blind Trust

I think people within the business intelligence space understand the critical importance of earning trust towards your data. And one of our greatest fears is delivering untrustworthy data, as it can erode the organization’s trust, and rebuilding it is exceptionally challenging.

We should be safe with the data quality checks already described, right? The answer is “No”. Even after implementing data quality checks from both the source side and the downstream user side, we still lack an essential part in building a good data quality setup.

This became apparent to me when a data scientist approached me, believing he had identified a data quality issue. I assured him everything was as it should be and that the data quality checks had successfully passed. He then expressed, with an irritated tone, the significant amount of time he had spent investigating the data. During our conversation, he said:

“Unfortunately, We Do Not work with Blind Trust“

I then decided to address the issue of my previous solutions assuming this ‘Blind Trust’. Of course, end-users of the data will wonder and explore the data if they see something odd. But can we prevent them from spending unnecessary time on this?

Searching For a Better Solution

Together with my colleaque, Radu Mihut, we thought about approaches we could take to create a transparent trusted relationship between data and downstream users. Initially, we focused on building dashboards to display data quality metrics for various tables. However, we encountered two problems with this approach:

Diverse Dashboards: Different teams tend to construct their data quality dashboards in varying ways, leading to inconsistency.
Assumed Comfort: We often assume that downstream users are comfortable with the solutions we build, without ensuring that their needs and preferences are met.

In essence, what we perceived as a great solution did not necessarily align with how other data teams preferred to build their solutions. This resulted in different teams creating their own Data Quality Dashboards. Consequently, we encountered the problem of inconsistency across the organization. Moreover, such solutions were prone to accumulating technical debt and eventually fading away as the organization evolved its structure.

I would like to conclude this section with a quote from one of my highly respected colleagues, Febiyan Rachman:

“If you have two places to go and look for data quality, you actually have nowhere to go and look for it.”

The quote gave us an even better understanding of the importance to consistency and centralization when talking about a good data quality setup within an organization.

Solution — Combining Databricks and Great Expectations

To be completely honest, I have a great affinity for Databricks. It is a technology I engage with daily, and I frequently advocate for its adoption in various use cases with new clients. Therefore, it seemed only natural to seek a solution that could be built within the Databricks environment. However, I also aimed to create something capable of checking data quality for datasets residing in different platforms, including Databricks, Synapse, Fabric and a SQL Server. Thus, we embarked on a journey to develop such a solution.

With the guidance and recommendations from my previous colleague, João Tiago Viegas, we decided to invest time and effort into integrating Great Expectations as a fundamental component of our platform.

Let us fast forward to the end result! To give you a small taste of what we built, another colleague of mine, Magnus Mortensen, and I created a demo with the results hosted on an Azure Storage account. Here you will find a website with some table data quality checks. Feel free to give it a look!

Great Expectation Data Quality Checks.

What you see is the results built on top af a nice and simple modular data quality solution. With this solution, we try to foster transparency and trust as everyone can see the daily checks on our tables.

If you would like to create data quality checks on your tables and make them transparent for your organization, I have provided you with a link below to get started.

Get Started With Your Data Quality Solution.

Learning Validation Checks Differs — Traffic Light Solution

After completing the solution using Databricks and Great Expectations, I discovered a crucial aspect that was necessary before considering the solution as a “good solution.” It involved distinguishing between different types of validation checks.

This realization came through the insights shared by another colleague of mine, Silvio de Melo Liborio, who drew from his previous experience and introduced me to the concept of creating a “Traffic Light” solution.

This “Traffic Light” solution is designed to allow us to handle data quality checks differently. In the solution using Databricks and Great Expectations, all data would pass through the pipeline, regardless of its quality. Then the website would show SUCCESS or FAILURE and it would be left to the downstream user to decide whether to use the data or not. However, this approach gave too much responsibility and power to the downstream users. And here we need to find the right balance!

RED — CRITICAL FAILURE

As data engineers, we should still implement certain data validation checks where if the check fails, the pipeline should not proceed, and the data should not be persisted, making it unavailable for the end user. It could be duplicates on primary keys, it could be missing values in datetime columns or in general, specific areas missing data.

If such validation checks fail, we see the failure as RED. The pipeline will fail, the data will not be persisted, and an email will be triggered to notify both the data team and the downstream users. Additionally, the modular data quality solution in Databricks and Great Expectations would not proceed until the issue has been resolved.

YELLOW — NON CRITICAL FAILURE

Let us examine a validation check defined as YELLOW. Some data quality checks are important but not critical for all downstream applications. For these checks, we will not fail the data pipeline. Instead, we will allow the end user to decide to use the data or not. This approach provides some flexibility for downstream users while still having full control as a data engineer.

The type of validation errors where you leave the decision to the downstream users would typically be less critical or very application-specific. For instance, a data science team might impose expectations on outliers and distributions of certain columns. These data quality checks may not be relevant for other teams using the data. And for such data quality checks, it makes sense not to automatically fail a data load.

We introduce YELLOW FAILURES to ensure one failed data quality check impacting one application does not adversely affect all other dependent downstream applications. Similarly, a failed data quality check on a column that is not relevant for a particular application should not cause the application to fail running on new data.

GREEN — PASS

We see a data quality check as GREEN if it succeeds.

Automating Downstream Applications

With our implementation of a modular data quality solution in Databricks and Great Expectations with the possibility of having YELLOW FAILURES, we need to define how our downstream users should utilize the data quality solution.

Can we expect the downstream users to manually check the data quality report before running their pipelines? While this approach may suffice for ad. hoc investigation and analysis, downstream applications require automated runs.

To define the process of automating downstream pipelines in response to YELLOW FAILURES, collaboration between data engineers (the creators of the datasets) and downstream users is crucial. This collaboration aims to establish a common agreement on implementing conditional logic within the pipeline orchestration system.

The goal is to allow downstream applications to proceed with data, as YELLOW FAILURES may not have a significant impact. To facilitate this automation, I recommend implementing an audit table to log all YELLOW FAILURES for a given data load. The suggested schema for this table is as follows:

CREATE TABLE yellow_table_validations (
    dw_job_id STRING COMMENT 'ID of the Databricks Data Lakehouse job',
    dw_job_run_id STRING COMMENT 'ID of the Databricks Data Lakehouse job run',
    job_name STRING COMMENT 'Name of the job',
    job_task_name STRING COMMENT 'Name of the job task',
    domain_name STRING COMMENT 'Name of the domain',
    full_table_name STRING COMMENT 'Full name of the table',
    check_status STRING COMMENT 'Status of the data check',
    check_failures STRING COMMENT 'List of failed checks for the given load (comma-separated)',
    executed_by STRING COMMENT 'User who executed the job task',
    table_version INT COMMENT 'Version of the table',
    job_task_start_timestamp_utc TIMESTAMP COMMENT 'Start timestamp of the job task (UTC)',
    job_task_end_timestamp_utc TIMESTAMP COMMENT 'End timestamp of the job task (UTC)'
)

The table incorporates essential information obtained from workflow jobs and delta transaction logs. You can retrieve this information using the following code snippet in your notebook:

# Fetch workflow job metadata
dict_job_run_metadata = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())

# How to fetch metadata in the transaction log
run_stats_df = spark.sql(f"DESCRIBE HISTORY hive_metastore.{domain_name}.{table_name}").limit(1)

user_name = run_stats_df.select("userName").collect()[0][0]

The downstream user can utilize the columns “full_table_name,” “check_status,” and “check_failures.”

If the “check_status” is “OK,” downstream applications can proceed.
In the case of “FAILURE,” downstream users can refer to the “check_failures” column and determine if the YELLOW NON-CRITICAL FAILURE affects their specific application. If not, they can proceed accordingly.

The above gives your downstream users the ability to automate their pipeline runs with the data quality checks included as a condition.

Do not Forget KPIs for Top Management!

There is one crucial aspect we must not overlook: The Data Quality Key Performance Indicators (KPIs) for top management.

It is essential to measure and track the current state of data quality and monitor progress over time. This will help you justify the time spent on data quality and how much impact you are generating with the solution. How much does it cost us to make decisions on bad quality data? How much time do we use in investigating data quality? You can to some extent easily monetize this.

The data quality checks we have implemented provide valuable insights into issues with source data, enabling the organization to pinpoint areas for improvement and refine processes to enhance overall data quality. With correctly chosen KPIs, we can then effectively communicate the improvements and get the needed support from top management.

The current solution in Databricks and Great Expectations enables you to build KPIs covering:

Data Accuracy: Percentage of data that meets predefined accuracy standards.
Data Completeness: Percentage of required data elements that are present and populated.
Data Integrity: Measures the overall reliability and trustworthiness of data.

You would need to expand the solution to cover:

Data Consistency: Degree to which data values across different sources or systems align.
Timeliness of Data: Average time taken to capture, process, and make data available for analysis.

Wrap Up!

I trust that you found my journey relatable and insightful, perhaps even seeing parallels with your own experiences along the way. As this story is deeply personal to me, I welcome your thoughts and reflections. Please feel free to reach out to me directly at cgh@backstagecph.dk. Additionally, I invite you to explore our website at backstagecph.dk.