AWS Glue Data Quality: the ultimate guide to turning data into reliable decisions

Published in

Data Reply IT | DataTech

12 min readNov 14, 2023

The modern digital world is flooded with huge amounts of data that come from many sources and vary in quality and reliability. We need efficient tools that ensure that the data are accurate, complete and free of anomalies.

This article explores the AWS Glue Data Quality service, an innovative solution offered by the AWS cloud platform. Through a detailed survey of its features and capabilities, we highlight how Glue Data Quality enables companies to detect, diagnose, and correct data quality issues in an automated manner. In particular, we examine its efficiency in detecting anomalies, its ease of use and its integration capabilities with other AWS solutions, with some examples of use cases.

We conclude with some thoughts on the potential impact of Glue Data Quality in the current data management landscape and its relevance to companies seeking to derive value from increasingly complex and varied data.

Why is Data Quality important?

Data Quality is fundamental for a variety of reasons, spanning across business, science, government, and numerous other sectors. There are many reasons why it is essential to maintain high data quality, including:

Conveying business decisions: business decisions must be based on accurate and reliable data. Low quality data could lead to incorrect decisions that negatively impact business operations.
Precise analyses: data analysis is a fundamental part of many business activities. Low-quality data could lead to inaccurate results and misinterpretations.
Regulatory compliance: many companies are subject to strict regulations on data management. Lack of data quality could lead to regulatory violations and financial penalties.
Time savings and efficiency: high-quality data simplify business processes. Cleaning and correcting data takes significant time and effort. High-quality data therefore reduce the need for such activities.
Customer satisfaction: data quality directly affects customer satisfaction. Incorrect data can lead to errors in customer reports and communications.

What is AWS Glue Data Quality?

AWS Glue Data Quality is a feature of AWS Glue, Amazon’s fully managed extract, transform, and load (ETL) service. This feature provides users with the ability to validate and monitor the quality of data sources, making it easier to maintain high-quality data for analytics and machine learning applications.

Below are the main features of Glue Data Quality.

Automatic recommendations of custom rules for your data

Initiating data quality processes can be challenging since it requires manual data analysis to establish quality standards. However, AWS Glue Data Quality streamlines this by automatically generating statistics for your data sets. Based on these statistics, it suggests quality rules ensuring the data’s freshness, accuracy, and integrity.

Get data quality at rest and in pipelines

Your data exists across various repositories and is frequently transferred from one to another. Ensuring the quality of this data, both upon arrival and during transit, is crucial.

AWS Glue Data Quality allows you to apply quality rules to both stationary data in datasets and data lakes, as well as to data flowing through entire pipelines. These rules can be implemented across multiple datasets. To this end, using AWS Glue Studio to create data pipelines, a quality assessment transformation can be integrated. In addition, rules can be set to halt the pipeline if there’s a dip in data quality, ensuring that compromised data doesn’t contaminate data lakes.

Server-less, cost-effective, and large-scale data quality

AWS Glue is server-less, allowing you to scale without the burden of infrastructure management. It can handle data of any size and offers a pay-as-you-go billing model, promoting cost efficiency.

Glue Data Quality is built on top of Deequ, an open source framework developed by Amazon for managing large-scale datasets. The use of open source ensures that Glue Data Quality offers flexibility and portability.

Understand and correct data quality problems

AWS Glue Data Quality offers a comprehensive dashboard that allows users to view the outcomes of their data quality assessments, facilitating easier tracking and monitoring of data quality trends.

This information can guide the establishment of new rules and procedures for evaluation and correction, so as to improve data quality in the future.

Using Glue Data Quality to its fullest potential

Data Profiling: AWS Glue Data Quality offers data profiling tools designed to enhance your comprehension of both the structure and the overall quality of your data. With these tools at your disposal, you can effortlessly identify and address various prevalent issues that impact data quality, such as anomalies, duplicate entries, missing values, and more.
Data quality rules composition and execution: with Glue Data Quality, the precise definition of rules is done through the Data Quality Definition Language (DQDL). This specialized language, tailored for the field of data quality, allows for the explicit expression and automation of data quality rules.
When undertaking a data quality task using AWS Glue Data Quality, the system meticulously evaluates a defined set of rules against the dataset in question. After this evaluation, it then computes a data quality score. This particular score serves as a quantitative representation, illustrating the percentage of data that successfully adheres to and meets the established data quality rules for the given input, thereby providing a clear insight into the dataset’s overall integrity and reliability.
Glue Data Quality within ETL pipelines: In the modern data-driven landscape, ETL (Extract, Transform, Load) pipelines play a critical role in consolidating data from various sources, transforming it into a unified format, and then loading it into a target system for analytics or other operations. AWS Glue Data Quality integrates seamlessly into ETL pipelines to ensure that data flows from source to destination maintain the highest standards of accuracy, consistency, and reliability.
Alerts and monitoring: Whenever discrepancies or anomalies that deviate from established data quality rules are detected, Glue Data Quality promptly issues alerts, enabling timely intervention. This system ensures that stakeholders are always informed about the health and integrity of their data. Moreover, the integrated monitoring features provide a holistic view of data quality trends over time, facilitating proactive measures and continuous improvement.
Integration with other AWS services: Glue Data Quality is not an isolated solution, it can be integrated with a number of other AWS services. For instance, data ingested through AWS Glue can be quality-checked and then effortlessly transferred to services like Amazon Redshift for analytics or Amazon S3 for storage. Additionally, by working in tandem with Amazon CloudWatch, users can receive timely alerts and monitor data quality metrics in real-time. Alternatively, through the use of Amazon Athena, you can build tables on top of the metadata produced by Glue Data Quality, in order to provide powerful means of analysis and reporting.

Data Profiling with AWS Glue Data Quality

Data Profiling is a key part of data quality management. It provides a better understanding of the structure and quality of the data, identifying any issues that could affect the accuracy and reliability of the data. Glue Data Quality provides tools for data profiling, simplifying this activity.

What is Data Profiling?

Data profiling is the process of analyzing data that seeks to identify its key characteristics. This can include:

Basic statistics such as mean, median and standard deviation.
The distribution of values for each column.
The presence of missing values.
The presence of duplicate values.
The length of the values in the text columns.
The validity of values with respect to specific rules.

Data profiling is an important first step in understanding the data you are working with.

How does Data Profiling work with Glue Data Quality?

AWS Glue Data Quality provides features that help users understand and improve the quality of their data. Data profiling is a key aspect of this process. An example of how to properly proceed with data profiling using Glue Data Quality is described below.

Crawling of Data Sources: Before Glue Data Quality can profile data, it needs to know the structure and source of the data. This is done using AWS Glue’s “crawlers” that explore and catalog data from various sources such as Amazon S3, RDS, Redshift and others.
Execution of profiling job: Glue Data Quality will run profiling jobs on the data. These jobs will analyze the actual data in the specified sources, calculating the some metrics.
Generation of statistics: after profiling, Glue Data Quality will provide statistics on the data. These statistics are detailed information on aspects such as data distribution, outliers, frequent patterns, etc.
Viewing results: Profiling results are presented in dashboards and interactive reports within the Glue Data Quality interface. Users can examine metrics, compare them over time, and identify areas of concern.

Benefits of Data Profiling with AWS Glue Data Quality

Among the many benefits of profiling is certainly the understanding of the data: in fact, there is an in-depth view of the data, helping to better understand its structure and quality.
Another benefit is the identification of data quality problems: in fact, data profiling can identify common problems such as missing values, duplicates and anomalies.
In addition to these, because it is an automated process, it saves time compared to manual data analysis.

Data Quality rule composition and Data Quality Definition Language

Of course, Glue DataQuality not only suggests rules to be executed, but also allows the user to define new ones, through the Data Quality Definition Language.

The Data Quality Definition Language (DQDL) is a specific language used to define metrics and rules related to data quality. It provides a standardized interface for specifying how to assess and monitor data quality.

Structure of DQDL

A DQDL document distinguishes between uppercase and lowercase and contains a rule set that groups individual data quality rules.

To construct a rule set, you must create a list called Rules that contains one or more DQDL rules separated by commas.

Rules = [
  IsComplete "col_A",
  IsUnique "col_B"
]

The structure of a DQDL rule depends on the type of rule, but generally fit the following format:

<RuleType> <Parameter> <Parameter> <Expression>

DQDL supports and and or logical operators that can be used to combine rules.

The following example uses the and operator to combine two DQDL rules.

(IsComplete "col_A") and (IsUnique "col_B")

(RowCount "col_C" > 100) or (IsPrimaryKey "col_D")

However, all rule types are available on this page.

Executing Data Quality Rules

When running a data quality task, Glue Data Quality evaluates a set of rules against the data and calculates a data quality score. This score represents the percentage of data quality rules approved for input.

By selecting the rule set you wish to evaluate against the table and running it, you will then be able to choose whether to publish the metrics to Amazon CloudWatch. When this option is selected, Glue Data Quality publishes metrics indicating the number of passed and failed rules. Key metrics are also published on Amazon EventBridge to allow you to set alerts.

You can also choose to run the rule set on demand or by scheduling them. In the last case to schedule will be created in Amazon EventBridge.

You have the choice to apply a filter to the data source, which aids in minimizing the volume of data being processed. Additionally, filters can be used to to perform incremental validations by choosing specific partitions.

When the run is completed you will be able to view the quality score results.

Viewing data quality score and results

By then scoring the table for which you want to perform a data quality task, you will be able to view the snapshot of data quality: this shows a general trend of runs over time.

In the data quality table, each rule set is shown with the last execution (if any), associated with the score. Here you can also see the score history and run status, as well as specific execution information and further details on the result status of individual rules.

Glue Data Quality within ETL pipelines

AWS Glue Data Quality provides an easy way to measure and monitor the quality of ETL pipeline data. We will learn how to take the necessary actions based on data quality results that help maintain high data standards and make confident business decisions.

When it comes to building ETL (Extract, Transform, Load) pipelines with AWS Glue, the addition of data quality checks can be critical to ensure the correctness, completeness and reliability of the data processed. Here are a few lines on how to use Glue Data Quality associated with a Glue ETL pipeline.

ETL Job

Start by creating an ETL job in AWS Glue, defining the data sources, the transformations needed, and the destination of the data.

Integration with Glue Data Quality

Once the job is defined, data quality checks can be integrated. This can be done by creating data quality rules in Glue Data Quality. These rules can include checks such as checking for null values, duplicates, outliers and others.

You define the rules you want to apply to your data. These rules can be based on conditional expressions, data statistics, or comparisons with reference data.

Snapshot of a basic ETL Pipeline with Glue Data Quality step

Running and Monitoring

Continuing, it will be possible to choose whether to output the original data, so as to stop the job if data quality issues are detected: in this case, four columns are added to the output schema: DataQualityRulesPass, DataQualityRulesFail, DataQualityRulesSkip and DataQualityEvaluationResult. Then, you can use the values of these columns to filter the rows and act according to your needs.

You will also be able to choose the output of the rules and their status of acceptance or failure. This option is useful if subsequently you want to take a custom action.

Example of basic ETL pipeline with Glue Data Quality: on the left, you can see the rowLevelOutcomes, then the original data plus the data quality rule information; on the right, you can see the ruleOutcomes, so the results for each data quality rule.

Once problems are identified, they can be corrected directly in the data source, and then re-run the ETL job.

In summary, using Glue Data Quality in conjunction with a Glue ETL pipeline helps ensure that the data is not only transformed correctly, but also of high quality and conforms to the expectations and needs of the organization.

Alerts and Monitoring with AWS Glue Data Quality

AWS Glue Data Quality provides a robust framework for monitoring and alerting. By integrating with AWS monitoring services, you can set up customized alerts that notify you when data quality issues are detected, allowing for timely intervention. Data quality checks can include validations for completeness, accuracy, conformity, and consistency, and any anomalies can trigger alerts. This proactive approach to data governance helps maintain the integrity of your data ecosystem.

Here is an example of a straightforward architecture for implementing alerts with Glue Data Quality.

Set up alerts with AWS Glue Data Quality

The architecture consists of the following key steps:

The initial step includes conducting automated evaluations with AWS Glue Data Quality, orchestrated via Step Functions. This workflow is programmed to start quality checks according to predefined criteria associated with the dataset.
Upon completion of AWS Glue Data Quality evaluations, EventBridge receives a notification event that contains the outcomes. This event is examined within EventBridge, which activates a Lambda function to handle the notification.
The Lambda function dispatches an SNS alert with data quality metrics to the specified email address. Additionally, if required, the function records the tailored outcome in an Amazon S3 bucket for subsequent analysis or processing steps.

Simple architecture for implementing alerts with Glue Data Quality

Integration with Other AWS Services

Another significant aspect of Glue Data Quality is its integration with a number of other AWS services. Glue Data Quality is not isolated, and this integration greatly simplifies data processing and analysis. Here are some of the main integrations of Glue Data Quality (but many more exist).

Amazon S3: AWS Glue Data Quality can connect directly to Amazon S3 buckets to profile, monitor, and clean data stored on S3.
Amazon Redshift: If you use Amazon Redshift for data analysis, you can easily integrate AWS Glue Data Quality to ensure data quality.
Amazon RDS: If you use Amazon RDS relational databases, you can connect AWS Glue Data Quality to improve data quality within the databases.
Amazon Athena: AWS Glue Data Quality is fully compatible with Amazon Athena, an interactive query service for analyzing data on S3.

Conclusions

Data quality is critical to business success in the modern era. AWS Glue Data Quality provides essential tools and capabilities to ensure data quality, including data profiling, alerting and continuous monitoring. This service is designed to simplify the data quality management process, reducing the cost, the time and the effort required.

If your business depends on data to make decisions, conduct analysis, and serve customers, AWS Glue Data Quality could be a key element in your strategy to ensure high-quality data. Its integration with other AWS services makes it a logical choice for companies using the AWS ecosystem for data processing and analysis.

Managing data quality requires commitment and resources, but investments in this area are often justified by the benefits they bring. With AWS Glue Data Quality, you can simplify the process and focus on what matters most: getting the most value from your enterprise data.

So, if you are looking for a solution to improve the quality of your data and gain a clearer and more reliable view of your business operations, seriously consider adopting AWS Glue Data Quality. High-quality data management is the key to sustained success in the modern business world.