Inside the Data Quality Task in Google Dataplex

Published in

Google Cloud - Community

9 min readDec 27, 2022

Introduction

Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.

Data is one of the most important and valuable assets for any organization. Data drives the ability to make informed decisions. However, on the other hand, if data assets do not meet the expected quality in terms of completeness, accuracy and consistency it may have major repercussions for an organization as it may lead to inaccurate business decisions. Data asset becomes a real asset when it meets the expected quality and drives the business value across an organization. Hence, Data Quality remains one of the most important pillars of the data governance process and a crucial aspect of any modern data platform solutions.

Dataplex has two options to validate and measure data quality:

Data Quality Task (Generally Available) provides customizable experience which is scalable and fully managed DQ validations in BigQuery and GCS. It uses CloudDQ open source engine. The tasks run on a scheduled basis on serverless infrastructure managed by Dataplex.
Auto data quality (Public Preview) automates and simplifies data quality definitions using UI driven workflows. It has built-in reports and also drives actions through alerting and troubleshooting.

In this post, I intend to cover the Data Quality Task architecture, building blocks and functionality that validates and provides data quality insights on distributed data assets residing on different storage services like BigQuery tables and GCS.

Data Quality Task Architecture Components:

Above diagram illustrates the execution of DQ Task and summarizes underpinning DQ validation inputs and outputs.

User defines DQ rules and filters and binds them to data assets via rule bindings. All this is captured in a DQ configuration (YAML file). User also provides a BigQuery table in which DQ results will be stored. The DQ Task rule engine consumes defined DQ configuration in YAML and converts it into BigQuery SQL dialect using serverless spark engine which is fully managed by Dataplex.The validation process leverages BQ compute to execute the SQLs and measures the data asset content against the defined DQ rules and stores measured DQ outputs to specified BQ table.

Dataplex DQ Task Salient Features:

Declarative DQ rules configuration :

Rules configurations are written as yaml files which can be managed as any other code base and can have code versioning and CI/CD process.
Rules parameterization is possible for Custom-defined rules for which values can be declared at the point of rule application (i.e. rule binding).
BigQuery SQL dialect can be used to define custom rules

2. In place validation within BQ and GCS:

Dataplex DQ Task carries out validation processing by pushing down the rules instructions to the underlying services (such as BQ)
The validation process is performed without data leaving its origin. So no data copy processes.

3. Asynchronous and non-interfering validation

Dataplex DQ Task is carried out in asynchronous mode
It does not interfere with data processing pipelines and neither perform any manipulations to the data that is validated.

4. Fully managed execution by Dataplex

100% serverless where users do not need to provision any infrastructure (compute, storage or even the scheduler)

5. Endpoints designed for programmatic consumption (persisted BQ storage and Cloud Logging sink)

User can build BI dashboards or use programmatically use the result for various data management use cases

Let’s see some examples of declarative DQ rules configuration

Simple example of a rule configuration for Dataplex DQ Task:

# rules_config.yaml
 
# define a rule
rules:
 REGEX_VALID_EMAIL:
   rule_type: REGEX
   dimension: conformance
   params:
     pattern: |-
       ^[^@]+[@]{1}[^@]+$
 
# apply a rule to data
rule_bindings:
 T2_DQ_1_EMAIL:
   entity_id: DQ_TABLE
   column_id: VALUE
   incremental_time_filter_column_id: TS
   rule_ids:
     - REGEX_VALID_EMAIL

2. Example of rules parameterization.

# rules_config.yaml
 
rules:
 SQL_LENGTH_LE:
   rule_type: CUSTOM_SQL_EXPR
   dimension: correctness
   params:
     custom_sql_arguments:
       - N                       # declaring an parameter
     custom_sql_expr: |-
       LENGTH( $column ) <= $N 
 
# apply a rule to data
rule_bindings:
 CHECK_COLUMN_LENGTH1:
   entity_id: DQ_TABLE1
   column_id: STRING_COLUMN1
   rule_ids:
     - SQL_LENGTH_LE
         N: 50                  # declaring the parameter value
 
 CHECK_COLUMN_LENGTH2:
   entity_id: DQ_TABLE1
   column_id: STRING_COLUMN2
   rule_ids:
     - SQL_LENGTH_LE
         N: 15                  # declaring the parameter value

3. Example of custom rule via SQL

FOREIGN_KEY_VALID:
 rule_type: CUSTOM_SQL_EXPR
 dimension: consistency
 params:
   custom_sql_expr: |-
     $column in (select distinct foreign_key from `<dataset-id>.customer_id`)

Lets deep dive and understand important concepts used to build DQ Task Validation

Rules

Rules capture measurement criteria against which data asset is measured. They do not specify which specific data assets they are applied to and can be reused across multiple data assets.Fundamentally rules can be applied at row-level or set-level.

Row-Level:

Row-level rules evaluate each row of the data entity separately and report validation outcomes individually for each row (i.e. each row is assigned a pass/fail outcome upon validation — this is then reported in an aggregated format.

Set-level :

Set-level rules evaluate a set of records as a whole and return a single pass/fail outcome for this set of records. Currently set-level rules can be defined only through custom SQL templates. ‘Set of records’ can be defined as:

Incremental records(modified or new) in the data entity identified since the last validation run
All records (non-filtered) within the data entity in case of non-incremental validation

The picture below illustrates how different types of rules apply to validated data:

Rules types are classified as

Built-in rule types (such as NOT_BLANK or NOT_NULL or REGEX)

Built-in rule types are defined within the DQ engine, so that platform-specific implementation is hidden from users.
All built-in rule types at this stage are applied at row-level

2. Custom SQL templates (such as CUSTOM_SQL_EXPR or CUSTOM_SQL_STATEMENT)

It allows users to define arbitrary DQ rules logic in BQ SQL dialect
CUSTOM_SQL_EXPR is applied at row-level while CUSTOM_SQL_STATEMENT at set-level

Both row- and set-level rules that are based on custom sql templates are provided with an implicit parameter $column which indicates the column being validated. This parameter is instantiated in rule bindings.

Below script provides example definitions of various types of rules at row-level and set-level

rules:
 
# Row-level rule based on a built-in rule type
 VALID_ITEM_ID:
   rule_type: REGEX
   params:
     pattern: |-
       [A-Z]{5}[0-9]{15}
 
# Row-level rule based on a custom sql template
 VALUE_ZERO_OR_POSITIVE:
   rule_type: CUSTOM_SQL_EXPR
   dimension: correctness
   params:
     custom_sql_expr: |-
       $column >= 0
      
# Set-level rule based on a custom sql template with a single parameter.
 GAP:
   rule_type: CUSTOM_SQL_STATEMENT
   custom_sql_arguments:
     max_gap
   params:
     custom_sql_statement: |-
       with lagged_data as (
         select $column, lag($column) over (order by $column ASC) as prev
         from data
       )
       select $column from lagged_data
       where prev is not null and datetime_diff($column, prev, HOUR) > $max_gap

Filters

Filters are reusable logic that specify which subset of records should be included or not included in the validation. Filters are defined separately and attached to rules and data entities in rule bindings.

Below is the example of a filter definition:

row_filters:
 DATA_TYPE_EMAIL:
   filter_sql_expr: |-
     contact_type = 'email'

Data entities

Data Entities are BigQuery assets that can be queries and validated such as tables, external tables (data physically reside in GCS bucket), and views.

DQ rules are applied to a single data entity. DQ configuration does not allow defining DQ rules that apply to multiple data entities (e.g. reconciliation rules). However referencing multiple data entities from a DQ rule is still possible in custom SQL templates.

Data entity is implicit for row-level rules. For custom SQL templates data entity is provided through the ‘data’ source. Validation column is provided to custom SQL templates through the implicit parameter $column.

Dataplex DQ Task applies several service functions to data entities before exposing them to DQ rules:

Environment-specific details(such as project or dataset) are parametrised which allows easy deployment of rules across various environments
Dataplex DQ Task performs some common data management operations — such as identifying data changes for incremental validation or applying optimisations by bundling multiple validation processes together — before passing data to a DQ rule for validation. This is performed transparently from users.
Only those records that passed user-defined filters are made subject for data validation.

Metadata registry

Dataplex DQ Task relies on structural metadata describing data entities to perform validation and various optimisation. This structural metadata is provided by metadata registries. Supported metadata registries are given in below script

#entities yaml metadata registry
entity_id: CUSTOMER_TABLE
 
#Dataplex metadata registry
entity_uri: dataplex://projects/<project-id>/locations/<region-id>/lakes/<lake-id>/zones/<zone-id>/entities/customer_table
 
#BigQuery metadata registry
entity_uri: bigquery://projects/<project_id>/datasets/<dataset_name>/tables/customer_table

The open parameters mentioned above (such as project-id) can be omitted and specified via metadata registry default configuration, thus allowing physical environment abstraction.

Rule bindings

Rule bindings define the validation scope for data quality rules validation. Rule bindings capture the following:

Rule bindings specify which DQ rules apply to which data entities. This is a many-to-many relationship, a single DQ rule can be applied to multiple data entities, and multiple DQ rules can be applied to a single data entity.
Rule bindings declare values for all parameters specified under custom SQL template rules
Rule bindings capture parameters that determine incremental validation (by using incremental timestamp column name) or using a user-defined filter.
Rule bindings define the level of granularity of capturing DQ validation outputs — results are produced at an aggregate level for each rule binding under each validation run.

VALIDATE_CUSTOMER:
 # data entity referenced via Dataplex metadata registry URI
 entity_uri: dataplex://projects/<project-id>/locations/<region-id>/lakes/<lake-id>/zones/<zone-id>/entities/customer_table # replace variables in <angle brackets> with your own configs
 column_id: email # validation column
 row_filter_id: DATA_TYPE_EMAIL # user-defined filter
 # set of DQ rules connected to the data entity and validation column specified in current rule binding
 rule_ids:
   - NO_DUPLICATES_IN_COLUMN_GROUPS:
       column_names: "value" # instantiating a parameter defined in the rule
   - NOT_NULL_SIMPLE

Validation

Once users specify DQ rules configuration (rules, filters, and rule bindings), Dataplex DQ Task will keep DQ outputs up to date as per schedule created for Dataplex DQ Task by user.

For each rule binding Dataplex DQ Task generates SQL instructions for data validation and then submits them to the underlying data platform (BigQuery) via push-down execution.

Push-down execution allows data validation without extracting data from the underlying storage platform, improving the performance and security posture of the validation process.

Outputs

DQ outputs capture raw results of DQ validation. Dataplex DQ Task stores the results in a specified BQ table but does not interpret these results to perform any actions — these steps are to be carried out by users and are outside the scope of Dataplex DQ Task.

DQ outputs are stored at an aggregated level for each rule_binding and validation run.

for row-level rules DQ outputs capture the total number of records that passed or failed validation for each rule binding and validation run
for set-level rules DQ outputs capture a binary “pass/fail” outcome for each rule binding and validation run

DQ outputs are captured via the following channels:

BigQuery — published to a user-defined DQ outputs table. Typically consumed by programmatic polling of outputs content (pull) to drive data pipelines and consumption decision making, or via BI dashboards established over the outputs table.
Cloud Logging — published as a dedicated event type to Cloud Logging that allows near-real time delivery to consumers. Typically consumed by programmatic subscription (push) to drive data pipelines or DQ incident management workflows.

By using these channels, users can build BI dashboards to visualize DQ outputs, or consume DQ outputs programmatically to drive data pipeline steps, data consumption, or DQ incident management.

Cost

Dataplex data quality tasks incurs charges for BigQuery and Dataproc Serverless (Batches) usage.

Dataplex DQ task converts the YAML configuration file to BigQuery SQLs and runs them in the user project which generates the cost as per BigQuery pricing.
Dataplex uses Spark to run the prebuilt, Google-maintained open source CloudDQ driver program to convert YAML configuration to BigQuery queries which generates the cost as per Dataproc serverless pricing
There are no charges for scheduling data quality tasks in Dataplex.

Key Considerations and Limitations for Dataplex DQ Task

There are some considerations and current limitations to use Dataplex DQ Task that might limit its applicability in advanced use cases.

Dataplex relies on an open source DQ engine — CloudDQ — to run these checks. Users will need to have an understanding of the CloudDQ capabilities and CloudDQ YAML specification.
Rule specification is an entirely manual process . There isn’t automatic rule inference.
While there is UI to schedule and monitor Data Quality runs — there is no UI for rules definition (it is specified via yaml).
Currently there is no support for incremental calculation of aggregate functions that might support validation of statistical/profiling properties of data (e.g. standard deviation)
At the moment DQ results are reported at an aggregated level either for the last data increment determined by when validation was run (for incremental validations) or for the whole data entity (for non-incremental validations). There is currently no ability to define custom boundaries for splitting validated data into batches for aggregated reporting.

Conclusion:

Dataplex Data Quality Task can be used to build reliable and scalable data quality solutions for a modern data platform on Google Cloud. It is well integrated with other GCP data services and uses 100% serverless mechanism for execution and scheduling. It leverages YAML specification syntax for DQ rule configurations which makes code management very easy and integrable with CI/CD pipelines. It is built upon an open source project (Cloud DQ) which receives community feedback for the improvements.

Hope you find this post interesting and useful. Happy learning !!

References:

https://github.com/GoogleCloudPlatform/cloud-data-quality/blob/main/README.md

https://cloud.google.com/dataplex/docs/data-quality-tasks-overview