AI Data Quality use-case: AlerTable

Published in

Data Reply IT | DataTech

7 min readApr 2, 2024

Introduction to Data Quality

Data quality is a timeless concept, deeply associated with the history of data collection practices.

As the saying goes, “you can’t manage what you don’t measure” highlighting the indispensable role of high data quality in fostering robust analytics programs. Moreover, data quality acts as a potent instrument for assessing data’s alignment with business requirements.

Despite its fundamental importance, data quality often takes a back seat to trendy terms such as machine learning, data science, or analytics as companies have limited resources and ultimately always opt for “bad data is better than no data.”

However, compromised data integrity undermines the credibility of the entire data platform.

Given the exponential data growth across organizations, data quality challenges, or data downtime become nearly unavoidable. Nevertheless, by grasping and defining data quality, organizations can effectively measure and mitigate these challenges before they escalate.

In this light, solutions like Data Reply’s AlerTable emerge as indispensable, offering a structured, automated approach to ensuring data quality and enabling informed, reliable decision-making. Continue reading to explore how AlerTable navigates these challenges, providing practical solutions to enhance data quality across diverse business settings.

What is AlerTable?

Drawing on years of experience in Data Quality projects, Data Reply has developed AlerTable, the AI-powered Data Quality solution, to simplify and automate the Data Quality management process. AlerTable autonomously understands data behavior and adapts detected rules based on intrinsic data variation. Key checks such as data freshness, format, cardinality, distribution, categories, and outlier detection can be automatically defined by the solution, with human intervention required only for the final validation step.

Core Functionalities

The time-to-market is low — it is designed to be an accelerator, that is, a solid framework for creating an enterprise solution for data quality controls.
Data Quality rule suggestion — it utilizes data processing from various sources to suggest Data Quality rules.
It incorporates GenAI for the intuitive implementation of Data Quality rules in natural language.
Adaptive thresholding — automatically adjusts suggested thresholds over time, streamlining maintenance efforts.
It provides a user-friendly interface enabling both technical and business users to manage and monitor Data Quality controls seamlessly.
The Vaadin framework provides functions to improve the overall security of web applications.
It is designed to monitor data quality issues across multiple platforms, including data warehouses, data lakes, ETL processes, and business intelligence.
It doesn’t store processed data, making it ideal for managing Personally Identifiable Information (PII).
Documentation generation — facilitates the instant generation of control documentation with a single click.
Scalable Architecture — powered by Spark-based computation, It ensures speed and scalability, handling large datasets efficiently.

The solution exposes APIs for notification management, ticketing tool integration, and exportation of Data Quality rule outcomes for customized dashboards and reports.

Architecture

On the technical side, the architecture of AlerTable is designed for flexibility and rapidity, making it suitable for easy deployment in both cloud and on-premises environments.

Essentially, it requires four components:

virtual machine, pod orchestrator, or physical server to deploy the web application.
Spark environment (e.g. Databricks) or SQL engine (e.g. RDBMS, BigQuery …) to execute the data quality controls.
SQL database for storing metadata and data quality scores.
storage for depositing verbose outputs.

This architecture provides optimal flexibility to adhere to any existing infrastructure while providing the scalability and performance needed for effective data quality management.

Three Approaches to Data Quality

Let us now get to the heart of the matter, illustrating how data quality can be achieved with AlerTable.

There are three distinct approaches to the solution.

As an example, architectural templates implemented on the AWS platform are shown below, but it should be kept in mind that the same implementations can be deployed on the other major cloud providers, such as Google Cloud, Azure, or Databricks.

Approach 1: Data Quality for Data Engineers

Automates the management of low-level checks (e.g., null detection, blank, value subsets, syntax) by suggesting both checks and thresholds.

Data Quality for Data Engineers architecture on AWS

This approach is particularly good for introducing a large number of low-level controls with minimal manual effort. It is ideal from the perspective of a data engineer, who must implement technical controls on every column of every table to be ingested into the data platform while having limited knowledge of the data and its contents.

In addition, the ability to use AlerTable without a Web application is particularly helpful, as it allows data engineers to manage data quality checks directly from JSON files. This simplified approach improves efficiency and flexibility, enabling easy integration of data quality processes into existing workflows.

Approach 2: Data Quality for Data Analyst / Data Scientist

It simplifies the management of targeted and in-depth data quality checks for contextual and detailed analysis.

Data Quality for Data Analyst / Data Scientist architecture on AWS

AlerTable, using anomaly detection and root cause analysis models, suggests more targeted controls to make it easier for data analysts or data scientists to perform the necessary checks or tests.

In this case, the web application provided by AlerTable proves valuable for streamlining the execution and management of the profiling engine and overseeing various checks. Moreover, manually configuring complex checks that involve fields across multiple data sources or examining the temporal trends of a metric becomes straightforward. The web application pages allow users to define data quality checks easily and instantly using filtering and aggregation features, offering pre-set templates. However, users have the flexibility to define their checks from scratch by directly implementing SQL if desired.

In this scenario, the profiling engine utilizes anomaly detection models (e.g., Isolation Forest) in conjunction with the Root Cause Analysis library to statistically identify correlations among the values within the data source’s fields.

The solution implements Machine Learning algorithms to extract statistical correlations between the processed data. The goal is to proactively identify controls and their thresholds, measure their impact, and suggest their implementation.

Approach 3: Data Quality for Data Owner / Business

It introduces Data Quality rules in natural language through the chatbot function. With GenAI integration, the solution can create data quality checks based on the description provided and produce exportable documentation of all implemented checks.

Data Quality for Data Owner / Business architecture on AWS

This approach involves using the web application to talk to AlerTable through the chatbot function. In practice, the chatbot simply collects the information that the user enters in natural language via compilation of the prompt, which will be sent to GenAI along with the schema identifying the format of the requested output. Once GenAI identifies the data quality control to be implemented, the chatbot submits it to the user and asks for confirmation. Finally, if the user confirms, the check is executed.

Use Case

Consider a scenario where we aim to apply data quality checks to the “customer_contacts” table.

This table comprises the following fields:

interaction_code: a unique sequential code (PK)
masked_customer_code: code identifying the customer
email: customer’s email address
channel: contact channel with the customer (e.g., email, phone, event…)
first_level_group: first level of customer grouping

As data engineers, our primary objective is to ensure the reliability and integrity of the data ingested into the data platform. In this instance, we populate a JSON file with references to the “customer_contacts” table and deploy it to the bucket. Subsequently, profiling is executed on all specified columns of the mentioned table.

Example of setting up a data quality check using JSON files

After the profiling engine runs, the recommended rules are reported in JSON format on the S3 bucket.

For example, for the “channel” field, the following data quality checks are suggested:

Null checks: ensure that the “channel” field does not contain null values.
Blank checks: ensure that the “channel” field does not contain blank values.
Value Set Validation: verify that the “channel” field accepts only predefined values such as email, phone, or event.
Data Type Validation: validate that the “channel” field accepts only string data types.

The same process is repeated for each column of the “customer_contacts” table.

For each of these checks, AlerTable suggests a threshold to be set and each time the check is run verifies whether that threshold is still appropriate.

With AlerTable’s dynamic suggestion and storage of data quality checks in JSON format, data engineers can efficiently implement a comprehensive suite of low-level data quality checks. This simplified approach ensures the reliability and integrity of data ingested into the data platform, even with limited manual intervention.

Conclusions

In conclusion, AlerTable stands as a comprehensive solution for efficient data quality management. With its flexible architecture, powered by machine learning algorithms and GenAI integration, AlerTable simplifies the implementation of low-level checks, simplifies data contextualization for targeted analyses, and introduces data quality rules in natural language through chatbot integration. Through tailored approaches for data engineers, analysts, and business users, AlerTable enables organizations to enhance decision-making and mitigate data quality challenges effectively.