Streamlining Data Quality Control with a Centralized Service

Nhat
NE Digital
Published in
6 min readAug 29, 2023
Photo on Unsplash

Backgrounds

Like other teams utilizing the Google Cloud Platform, GCP BigQuery (BQ) serves as our main data warehouse for storing and processing large volumes of data, while Apache Airflow functions as the primary orchestration tool for managing data pipelines. Data Engineers play a key role in creating and maintaining these pipelines on a daily basis. Our responsibilities include ensuring data quality, promptly addressing any data related issues that arise, and maintaining the overall integrity of the data.

However, Data Engineers encounter several challenges when developing and maintaining Airflow DAG jobs for data quality control. Some of these challenges include:

  1. A Time-Consuming Process: The manual creation and configuration of Airflow DAG jobs for each table can be a time-intensive task, especially when dealing with thousands of tables.
  2. Complexity in Tracking: Managing individual DAG jobs makes it difficult to track the status of data quality checks across multiple tables.
  3. Lack of Standardisation: Creating separate Airflow DAG jobs for each table may result in inconsistent application of rules and standards for data quality checks. This can lead to errors and inconsistencies in the checks.
  4. Maintenance Difficulties: Maintaining many Airflow DAG jobs becomes challenging, particularly when updates to data quality checks are required. Ensuring consistent changes across all DAGs can be time-consuming.
  5. Limited Awareness Downstream: When data quality checks are executed through individual Airflow DAG jobs, downstream users might lack awareness of the quality of the tables they’re utilizing. This causes errors in downstream analysis and decision-making.

To address these challenges, our Data Engineering (DE) team has decided to develop a centralized Data Quality Control (DQC) system. This system will provide an end to end solution for managing data quality checks across all BQ tables. The DQC system aims to make things easier and also offering other benefits:

  1. Centralized Control: The in-house DQC system allows for centralized monitoring of data quality across all tables. This eliminates the need of managing multiple Airflow DAGs and provides a one-stop solution.
  2. Customization: In-house DQC systems can be tailored to match the specific requirements of the company. This enables the definition of rules and standards that align with the business needs to all tables.
  3. Scalability: A centralized DQC system can be easily scaled. Additional rules and checks can be integrated without the need for creating new Airflow DAGs for each table.

Building the Solution

Creating a centralized data quality control (DQC) system might initially seem complex, but it can be broken down into some key steps. Among these steps, selecting the right technology is quite important for the long term plan, and the data engineering (DE) team has decided to adopt an event-driven architecture that heavily relies on the Google Cloud Platform’s Pub/Sub.

Using an event-driven design is a reasonable way to handle this situation. This method means setting up the system to react to certain events instead of running all the time. The benefit is that it saves costs since the system only works when there’s an event.

(1) To begin, we will sink all BigQuery events from the Logging Service to Pub/Sub, using filters like modification (UPDATE, INSERT, or DELETE) to transmit only relevant data.

Example — Logging Service to Pub/Sub

(2) Once the data is flowed to Pub/Sub, the next step is about creating a consumer (referred to as BQ Logging Consumer). And it is hosted in a GKE Auto Pilot.

(3) When BigQuery logs’ source events are in place, we will develop a service that combines the front-end (FE) and back-end (BE) (API) components. This will keep all DQC rules in one location:

  • Allowing users to submit and maintain them.
  • Also The worker/consumer will be able to verify that the tables received from Pub/Sub match those submitted to the service.
Image for illustration only.

(3.1) Regarding the frontend, we have developed a simple React-based user interface. This allows end-users to easily select the table and DQC rules they wish to apply. In terms of DQC rules, there are two types: predefined rules and custom rules.

  • Predefined rules are common rules that have already been defined on the backend side. Users do not need to write the logic themselves, as these rules are already available. Examples include NOT NULL, NOT BLANK, and BIGGER, among others.
  • Custom rules, on the other hand, provide users with flexibility to create their own rules using SQL queries. Although we cannot cover all possible scenarios, allowing users to create custom rules themselves can help them address their specific needs.
Image for illustration only.

(3.2) To simplify the backend development process, we go with Flask Python. With most of the team’s background in Python, this choice allows us to quickly create the necessary services.

Up to this point, we have obtained the required information from the source and established a centralized repository to store and manage all DQC rules. Our next steps involve developing a flow that can execute these rules and present the resulting DQC outcomes to the users.

(4) Continuing from the previous step (2), the consumer receives data from BQ logs and verifies whether they belong to the tables submitted by users in the service (3). If they do, events are triggered to other PubSub to execute the DQC rules. It is essential to ensure that the solution can scale effectively, even when there are hundreds or thousands of DQC rules submitted by users. The main purpose of DQC is to provide users with feedback on the data quality of their BQ tables. Therefore, under this PubSub, we have consumers (known as executors) that can be automatically scaled up as more events arrive.

(5) After obtaining the results from the executors, we will publish them to another PubSub. It is not advisable to keep all the notification logic in the executor since it can be overwhelming for the executor. Similar to the previous PubSub, we also have a consumer for this PubSub. The consumer will send notifications to users via email or webhooks based on what they have submitted to the frontend.

Image for illustration only.

New features

After completing the 5 major steps of integration, we have a comprehensive DQC service. However, following the launch of the first version, we received requests for new features, such as data arrival checking (time-based rule) and the application of machine learning to detect potential data quality issues. Thanks to our decoupled solution, it is now easy to add new services as needed. For example, to implement a time-based rule, we can add Google Cloud Scheduler after the API to trigger events at specified times. Similarly, for machine learning-based detection of potential issues, we can switch the executor engine from BQ to SparkRoc so that we can leverage the spark to run the ML models that we want.

Image for illustration only.

--

--

Nhat
NE Digital

A data engineer who is passionate about data products and platforms