Autonomous data observability and trustability within AWS Glue Data Pipeline

4 min readJul 26, 2022

Data operations and engineering teams spend 30–40% of their time firefighting data issues raised by business stakeholders.

A large percentage of these data errors can be attributed to the errors present in the source system or errors that occurred or could have been detected in the data pipeline.

Current data validation approaches for the data pipeline are rule-based — designed to establish data quality rules for one data asset at a time — as a result, there are significant cost issues in implementing these solutions for 1000s of data assets/buckets/containers. Dataset-wise focus often leads to an incomplete set of rules or not implementing any rules.

With the accelerating adoption of AWS Glue as the data pipeline framework of choice, the need for validating data in the data pipeline in real-time has become critical for efficient data operations and for delivering accurate, complete, and timely information.

This blog provides a brief introduction to DataBuck and outlines how to build a robust AWS Glue data pipeline to validate data as data moves along the pipeline.

What is DataBuck?

DataBuck is an autonomous data validation solution purpose-built for validating data in the pipeline. It establishes a data fingerprint for each dataset using its ML algorithm. It then validates the dataset against the fingerprint to detect erroneous transactions. More importantly, it updates the fingerprints as the dataset evolves thereby reducing the efforts associated with maintaining the rules.

DataBuck primarily solves two problems:

A. Data Engineers can incorporate data validations as part of their data pipeline by calling a few python libraries. They do not need to have a priori understanding of the data and its expected behaviors (i.e. data quality rules)

B. Business stakeholders can view and control auto-discovered rules and thresholds as part of their compliance requirements. In addition, they will be able to access the complete audit trail regarding the quality of the data over time.

DataBuck leverages machine learning to validate the data through the lens of standardized data quality dimensions as shown below:

1. Freshness — determine if the data has arrived within the expected time of arrival.

2. Completeness — determine the completeness of contextually important fields. Contextually important fields are identified using mathematical algorithms.

3. Conformity — determine conformity to a pattern, length, and format of contextually essential fields.

4. Uniqueness — determine the uniqueness of the individual records.

5. Drift — determine the drift of the key categorical and continuous fields from the historical information

6. Anomaly — determine volume and value anomaly of critical columns

Setting up DataBuck for Glue

Using DataBuck within the Glue job is a three-step process as shown in the following diagram

Step 1: Authenticate and Configure DataBuck

Step 2: Execute Databuck

Step 3: Analyze the result for the next step

Business Stakeholder Visibility

In addition to providing programmatic access to validate AWS dataset within the Glue Job, DataBuck provides the following results for compliance and audit trail

Data Quality of a Schema Overtime:

2. Summary Data Quality Results of Each Table

3. Detailed Data Quality Results of Each Table

4. Business Self-Service for Controlling the Rules

Summary

DataBuck provides a secure and scalable approach to validate data within the glue job. All it takes is a few lines of code and you can validate the data in an ongoing manner. More importantly, your business stakeholder will have full visibility of the underlying rules and can control the rules and rule threshold using a business user-friendly dashboard.

Autonomous data observability and trustability within AWS Glue Data Pipeline

Written by Dutta