Google Cloud Dataplex -Part 3-Data Quality using Cloud DQ

Introduction

Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.

I intend to publish a series of posts on Google Cloud Dataplex, starting from the very basics, to eventually executing tasks which becomes gradually more complex. This is part 3 of the series and describes how to build a sophisticated, managed data quality component for modern data platforms using Dataplex Data Quality tasks.

Here’s how the series looks like, so far :

Part 1 : Google Cloud Dataplex -Part 1-Lakes, Zones, Assets and discovery

Part 2 : Google Cloud Dataplex — Part 2-gCloud CLI for Dataplex

Part 3: Google Cloud Dataplex -Part 3-Data Quality using Cloud DQ (Current Post)

Building an enterprise data platform almost always requires a strong data governance practice to ensure data trustworthiness. Data quality or DQ remains one of the most important pillars of the data governance practice on a data platform.

Data quality defines properties of data content that manifests its usefulness.

A managed DQ offering provides a set of features and capabilities to the end user to measure, monitor and improve DQ of a data platform.

Data Quality can conceptually be split into multiple stages that present how data quality operations typically flow.

These stages present a conceptual flow of events that help to categories various DQ capabilities. They do not necessarily reflect the sequence in which DQ capabilities are implemented. Usually elements across all 4 stages need to be in place to provide operational DQ capabilities.

  • Define includes numerous capabilities around establishing DQ benchmarks / rules, which breaks down into manual and automated DQ rules definition, DQ rules deployment, and DQ rules governance.
  • Measure includes processes that measure data against predefined DQ benchmarks/rules. These processes need to be efficient, scalable and easy to manage (ideally transparent) for users. This stage also covers defining how DQ measurement is deployed in data environments.
  • Observe includes capturing DQ measurement outputs and making them accessible and useful for users. This stage also includes identifying negative patterns in DQ outputs which would constitute issues and flagging them for subsequent troubleshooting.
  • Troubleshoot includes addressing previously identified data issues, which includes capturing, communicating, prioritizing, escalating, and remediating them.
Data Quality Lifecycle
Data Quality Lifecycle

Google Cloud Data Quality

Data Quality on Google cloud is not an isolated capability, but it’s rather interconnected with a number of other processes and capabilities in the data space. Data Quality both depends on them and supports them.

Google cloud Dataplex enables Data governance on data platforms built using Google products and tools. Please go through Part 1and Part 2 to understand the core capabilities of Dataplex which is an intelligent data fabric that provides metadata-led data management with built-in data quality and governance so data practitioners spend less time wrestling with infrastructure boundaries and inefficiencies, trust the data at rest , and spend more time deriving value out of this data.

Data Quality is a key part of data governance that is enabled by Dataplex.

Dataplex enables Data Quality by a built-in Data Quality Task that can be used to define rules and execute across any of BigQuery or Google Cloud Storage (GCS) tables.

Data Quality Task in Dataplex provides efficient, scalable, and fully managed DQ validations in BigQuery and GCS for distributed deployment across data teams and assets.

For data quality validations Data Quality task uses CloudDQ open source engine developed by Google Professional Services Organization (here). These tasks can run on a scheduled basis or in an ad-hoc manner ,on serverless infrastructure managed by Dataplex.

Key properties of Dataplex DQ tasks are, as given below :

  1. Declarative rules configuration and support for CI/CD
  2. Validation in place within BQ and GCS without extracting data, benefitting from BQ scalability and minimizing security risk surface
  3. Endpoints designed for programmatic consumption (persisted BQ storage and Cloud Logging sink), and supporting custom integrations with UI (such as BI and Metadata management tooling)
  4. Managed execution orchestrated by Dataplex

With the knowledge of the key features of Dataplex DQ tasks, let’s build a hands-on demo to illustrate how Dataplex DQ works in practice. In this demo, we’ll build a Dataplex Data Quality task and schedule it to run on a BigQuery table and we’ll interpret the results of the task after it is scheduled and run successfully.

Step-1 : Create a new Google Cloud project

Create a new google cloud project ‘dataplex-dq-demo’

Create anew Google Cloud Project
A new Google Cloud Project

Step-2 : Create a new BigQuery dataset

Create a new dataset and a table in BigQuery for the sample table that we will create.

Let’s call the dataset dataplex_dq_task_demo

A new BigQuery Dataset
A new BigQuery Dataset

Let’s create a table called orders using the script below :

CREATE OR REPLACE TABLE dataplex-dq-demo.dataplex_dq_task_demo.orders(ID STRING NOT NULL,LAST_MODIFIED_TIMESTAMP TIMESTAMP,CUSTOMER_ID STRING,ITEM_ID STRING,AMOUNT NUMERIC,TRANSACTION_CURRENCY STRING);
Create a new BigQuery table orders
Create a new BigQuery table orders

Step-3 : Insert some sample records

Once the table is successfully created, let’s insert some sample records, on which the data quality tasks will be run.

INSERT INTO dataplex-dq-demo.dataplex_dq_task_demo.orders(ID,LAST_MODIFIED_TIMESTAMP,CUSTOMER_ID,ITEM_ID,AMOUNT,TRANSACTION_CURRENCY)VALUES("order1", CURRENT_TIMESTAMP(), "customer1", "ASDWQ123456789012345", 100, "USD"),("order1", CURRENT_TIMESTAMP(), "customer2", "bad_item_id", -10, "XXX"),("order2", CURRENT_TIMESTAMP(), "customer3", "INTNL987654321098765", 50, "GBP"),("order3", CURRENT_TIMESTAMP(), "customer4", "INTNL932716428593847", 50, "GBP")
Insert sample records
Insert sample records
Check sample records
Check sample records

Step-4 : BigQuery dataset as Dataplex asset

Now, we’ll map the BigQuery dataset as an asset in Dataplex. In order to do so, first, we create a Dataplex lake called dq-demo-lake. To know more about how to create Dataplex lakes, zones and assets, please read the first two blogs of this series, Part-1 and Part-2.

Create a new Dataplex lake
Create a new Dataplex lake
Dataplex Lake
Dataplex Lake

We also create a zone called ‘transactions’ within the Dataplex lake and create it as a Raw Zone.

Create a Dataplex Raw Zone
Create a Dataplex Raw Zone

Let’s add the BigQuery dataset created earlier as an asset in the ‘transactions’ Dataplex zone.

BigQuery dataset added as Dataplex asset
BigQuery dataset added as Dataplex asset
Dataplex asset created
Dataplex asset created

We also create another dataset within the same project called ‘dataplex_dq_task_summary_dataset’ which will hold the results of the data quality task runs.

DQ task summary BigQuery dataset
DQ task summary BigQuery dataset

As a last step, we also create a Google Cloud Storage bucket called ‘dq-task-demo-config’ where the configuration file for writing the rules will be stored.

Google cloud storage for DQ config file
Google cloud storage for DQ config file

With this step, prerequisites for running the Dataplex data quality task is completed.

Step-4 : DQ rule configuration

We now define the configuration file with rules for data quality checks.

In this section, let’s break down each section of the rule file and understand what exactly it does.

The first section provides the metadata of the data asset, which in this case is a Dataplex table which discovers the BigQuery dataset.

metadata_registry_defaults:  dataplex:    projects: dataplex-dq-demo    locations: us-central1    lakes: dq-demo-lake    zones: TRANSACTIONS

Metadata registry defaults define default entity_uri values for each scheme.

For our case, since we are using Dataplex entities, we’ll use entity_uri to discover the table where data quality rules will be applied. Dataplex Data Quality tasks can also be applied to BigQuery tables which are not discovered using Dataplex. In that case, along with the following foursections, an ‘entities’ section will need to be created which describes the schema and the column details of the BigQuery table.

Broadly, apart from the metadata_registry_defaults, there are 4 sections within a configuration file.

ROW FILTERS

row_filters:  INTERNATIONAL_ITEMS:    filter_sql_expr: |-      REGEXP_CONTAINS(item_id, 'INTNL')

Row Filters define a filtering condition for data extracted from the respective entity to generate a sub-set of data for validation.

RULE DIMENSIONS

rule_dimensions:  - consistency  - correctness  - duplication  - completeness  - conformance  - integrity

Rule Dimensions define the allowed list of Data Quality rule dimensions that a Data Quality Rule can define in the corresponding dimension field.

RULES

rules:  VALUE_ZERO_OR_POSITIVE:  rule_type: CUSTOM_SQL_EXPR  dimension: correctness  params:    custom_sql_expr: |-      $column >= 0

Rules define reusable logic for data quality validation.

RULE BINDINGS

rule_bindings:TRANSACTION_AMOUNT_VALID:  entity_uri: dataplex://projects/dataplex-dq-demo/locations/us-  central1/lakes/dq-demo-lake/zones/transactions/entities/orders  column_id: amount  row_filter_id: NONE  incremental_time_filter_column_id: last_modified_timestamp  rule_ids:    - VALUE_ZERO_OR_POSITIVE

Rule Bindings define a single Data Quality validation routine. Each value declared in entity_id / entity_url, column_id, filter_id, and rule_id is a lookup key for the more detailed configurations that must be defined in their respective configuration files.

Now that we understand the anatomy of a configuration file where rules are written, here is the sample file that we’ll use for applying the rules on the sample BigQuery table.

dq_demo_config.yaml

metadata_registry_defaults:
dataplex:
projects: dataplex-dq-demo
locations: us-central1
lakes: dq-demo-lake
zones: TRANSACTIONS
row_filters:
NONE:
filter_sql_expr: |-
True
INTERNATIONAL_ITEMS:
filter_sql_expr: |-
REGEXP_CONTAINS(item_id, 'INTNL')
rule_dimensions:
- consistency
- correctness
- duplication
- completeness
- conformance
- integrity
rules:
NO_DUPLICATES_IN_COLUMN_GROUPS:
rule_type: CUSTOM_SQL_STATEMENT
dimension: duplication
params:
custom_sql_arguments:
- column_names
custom_sql_statement: |-
select a.*
from data a
inner join (
select
$column_names
from data
group by $column_names
having count(*) > 1
) duplicates
using ($column_names)
VALUE_ZERO_OR_POSITIVE:
rule_type: CUSTOM_SQL_EXPR
dimension: correctness
params:
custom_sql_expr: |-
$column >= 0
VALID_ITEM_ID:
rule_type: REGEX
dimension: conformance
params:
pattern: |-
[A-Z]{5}[0-9]{15}
VALID_CURRENCY_ID:
rule_type: CUSTOM_SQL_EXPR
dimension: integrity
params:
custom_sql_expr: |-
$column in ('GBP', 'JPY')
rule_bindings:
TRANSACTIONS_UNIQUE:
entity_uri: dataplex://projects/dataplex-dq-demo/locations/us-central1/lakes/dq-demo-lake/zones/transactions/entities/orders
column_id: id
row_filter_id: NONE
incremental_time_filter_column_id: last_modified_timestamp
rule_ids:
- NO_DUPLICATES_IN_COLUMN_GROUPS:
column_names: "id"
TRANSACTION_AMOUNT_VALID:
entity_uri: dataplex://projects/dataplex-dq-demo/locations/us-central1/lakes/dq-demo-lake/zones/transactions/entities/orders
column_id: amount
row_filter_id: NONE
incremental_time_filter_column_id: last_modified_timestamp
rule_ids:
- VALUE_ZERO_OR_POSITIVE
TRANSACTION_VALID_ITEM_ID:
entity_uri: dataplex://projects/dataplex-dq-demo/locations/us-central1/lakes/dq-demo-lake/zones/transactions/entities/orders
column_id: item_id
row_filter_id: NONE
incremental_time_filter_column_id: last_modified_timestamp
rule_ids:
- VALID_ITEM_ID
TRANSACTION_CURRENCY_VALID:
entity_uri: dataplex://projects/dataplex-dq-demo/locations/us-central1/lakes/dq-demo-lake/zones/transactions/entities/orders
column_id: transaction_currency
row_filter_id: INTERNATIONAL_ITEMS
incremental_time_filter_column_id: last_modified_timestamp
rule_ids:
- VALID_CURRENCY_ID

Step-6 : Upload DQ rule config file to Google Cloud Storage

We need to upload the dq_demo_config.yaml rule configuration file in the cloud storage bucket that has been created earlier ‘dq-task-demo-config’.

DQ rule config file to GCS
DQ rule config file to GCS

Step-7 : Create a Dataplex Data Quality task

With the configuration file uploaded, we can now create a Dataplex Data Quality task from the Google Cloud console. The same task can also be created with gCloud CLI. Click on Dataplex -> Manage Lakes -> Process on the left navigation menu. Then click on Data Quality -> Create Task.

Create Dataplex Data Quality Task
Create Dataplex Data Quality Task
Check Data Quality
Check Data Quality

Then click on Check Data Quality to open the create task screen. In the Create task screen the following fields need to be filled

Create Task
Create Task

Along with the task name, the rule configuration file will also need to be selected, which was uploaded earlier in the designated cloud storage bucket.

For the results table, let’s select the BigQuery dataset created earlier to store results summary. Since, no table was created in that dataset, let’s add a table name and the table will be created by the task.

Select BigQuery resutls table
Select BigQuery resutls table

For this task, we schedule this to run immediately.

Task scheduled immediately
Task scheduled immediately

The task could also be scheduled to run Daily, Monthly, Weekly, as shown below :

Schedule daily or monthly
Schedule daily or monthly

Or, it could be run on a custom schedule as well, using UNIX cron expressions:

Task CRON schedule
Task CRON schedule

Once the task runs successfully, the Data Quality console shows a ‘Succeeded’ message

Task succeeded
Task succeeded

Step-8 : Task summary on BigQuery summary table

Now, that the task run has successfully, the BigQuery summary table will hold the summary statistics of the run.

BQ summary table
BQ summary table

If we drill down into the table dq_summary, here’s how the task run details are inserted into the table :

Task run details
Task run details

Each rule and rule_binding_id is collected with the success count, success percentage, failed count, failed percentage etc ..

Rule Success Failure counts
Rule Success Failure counts

Step-9 : Data Quality dashboard with Google Data Studio

Once, the BigQuery summary table is populated with the task run summary statistics, a dashboard could be built on top of this data using Google Data Studio, which can be activated by clicking on the Export -> Explore with Data Studio link on the table.

Explore with Google Data Studio
Explore with Google Data Studio

Here is an example of a sample data quality dashboard created using Google Data Studio

Sample data quality dashboard
Sample data quality dashboard

For this example, four charts are displayed :

  • Percent data successfully validated for each rule_binding_id and dq_run_id
  • Highest validation failure per table_id, column_id, rule_binding_id and rule_id
  • Count of validation failures for each rule_binding_id and dq_run_id
  • Count of validation successes, failures and nulls for each dq_run_id

Similarly, different data quality dashboards could also be created based on the metrics and dimensions and an overall quality score could be computed based on the success counts and failure counts and running an average over a designated data range.

To summarize, Dataplex Data Quality tasks could be used to build sophisticated Data Quality dashboards describing real-time Data Quality metrics. Data Quality tasks are natively integrated with Google Cloud data integration products and uses serverless Spark jobs to run dq tasks in a managed way.

Being an open source project, cloud-data-quality receives community feedback to add features and remains an easy and declarative way of establishing a data quality component within modern data platforms built on top of Google Cloud.

Read more :

  1. Google Cloud Platform Cloud Data Quality
  2. Google Cloud Dataplex Data Quality Tasks

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Diptiman Raichaudhuri

Diptiman Raichaudhuri

Cloud Data platform specialist. Working closely with developers to design and build data platforms on public cloud using open source tools.