Google Cloud Dataplex -Part 3-Data Quality using Cloud DQ
Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.
I intend to publish a series of posts on Google Cloud Dataplex, starting from the very basics, to eventually executing tasks which becomes gradually more complex. This is part 3 of the series and describes how to build a sophisticated, managed data quality component for modern data platforms using Dataplex Data Quality tasks.
Here’s how the series looks like, so far :
Part 3: Google Cloud Dataplex -Part 3-Data Quality using Cloud DQ (Current Post)
Building an enterprise data platform almost always requires a strong data governance practice to ensure data trustworthiness. Data quality or DQ remains one of the most important pillars of the data governance practice on a data platform.
Data quality defines properties of data content that manifests its usefulness.
A managed DQ offering provides a set of features and capabilities to the end user to measure, monitor and improve DQ of a data platform.
Data Quality can conceptually be split into multiple stages that present how data quality operations typically flow.
These stages present a conceptual flow of events that help to categories various DQ capabilities. They do not necessarily reflect the sequence in which DQ capabilities are implemented. Usually elements across all 4 stages need to be in place to provide operational DQ capabilities.
- Define includes numerous capabilities around establishing DQ benchmarks / rules, which breaks down into manual and automated DQ rules definition, DQ rules deployment, and DQ rules governance.
- Measure includes processes that measure data against predefined DQ benchmarks/rules. These processes need to be efficient, scalable and easy to manage (ideally transparent) for users. This stage also covers defining how DQ measurement is deployed in data environments.
- Observe includes capturing DQ measurement outputs and making them accessible and useful for users. This stage also includes identifying negative patterns in DQ outputs which would constitute issues and flagging them for subsequent troubleshooting.
- Troubleshoot includes addressing previously identified data issues, which includes capturing, communicating, prioritizing, escalating, and remediating them.
Google Cloud Data Quality
Data Quality on Google cloud is not an isolated capability, but it’s rather interconnected with a number of other processes and capabilities in the data space. Data Quality both depends on them and supports them.
Google cloud Dataplex enables Data governance on data platforms built using Google products and tools. Please go through Part 1and Part 2 to understand the core capabilities of Dataplex which is an intelligent data fabric that provides metadata-led data management with built-in data quality and governance so data practitioners spend less time wrestling with infrastructure boundaries and inefficiencies, trust the data at rest , and spend more time deriving value out of this data.
Data Quality is a key part of data governance that is enabled by Dataplex.
Dataplex enables Data Quality by a built-in Data Quality Task that can be used to define rules and execute across any of BigQuery or Google Cloud Storage (GCS) tables.
Data Quality Task in Dataplex provides efficient, scalable, and fully managed DQ validations in BigQuery and GCS for distributed deployment across data teams and assets.
For data quality validations Data Quality task uses CloudDQ open source engine developed by Google Professional Services Organization (here). These tasks can run on a scheduled basis or in an ad-hoc manner ,on serverless infrastructure managed by Dataplex.
Key properties of Dataplex DQ tasks are, as given below :
- Declarative rules configuration and support for CI/CD
- Validation in place within BQ and GCS without extracting data, benefitting from BQ scalability and minimizing security risk surface
- Endpoints designed for programmatic consumption (persisted BQ storage and Cloud Logging sink), and supporting custom integrations with UI (such as BI and Metadata management tooling)
- Managed execution orchestrated by Dataplex
With the knowledge of the key features of Dataplex DQ tasks, let’s build a hands-on demo to illustrate how Dataplex DQ works in practice. In this demo, we’ll build a Dataplex Data Quality task and schedule it to run on a BigQuery table and we’ll interpret the results of the task after it is scheduled and run successfully.
Step-1 : Create a new Google Cloud project
Create a new google cloud project ‘dataplex-dq-demo’
Step-2 : Create a new BigQuery dataset
Create a new dataset and a table in BigQuery for the sample table that we will create.
Let’s call the dataset dataplex_dq_task_demo
Let’s create a table called orders using the script below :
CREATE OR REPLACE TABLE dataplex-dq-demo.dataplex_dq_task_demo.orders(ID STRING NOT NULL,LAST_MODIFIED_TIMESTAMP TIMESTAMP,CUSTOMER_ID STRING,ITEM_ID STRING,AMOUNT NUMERIC,TRANSACTION_CURRENCY STRING);
Step-3 : Insert some sample records
Once the table is successfully created, let’s insert some sample records, on which the data quality tasks will be run.
INSERT INTO dataplex-dq-demo.dataplex_dq_task_demo.orders(ID,LAST_MODIFIED_TIMESTAMP,CUSTOMER_ID,ITEM_ID,AMOUNT,TRANSACTION_CURRENCY)VALUES("order1", CURRENT_TIMESTAMP(), "customer1", "ASDWQ123456789012345", 100, "USD"),("order1", CURRENT_TIMESTAMP(), "customer2", "bad_item_id", -10, "XXX"),("order2", CURRENT_TIMESTAMP(), "customer3", "INTNL987654321098765", 50, "GBP"),("order3", CURRENT_TIMESTAMP(), "customer4", "INTNL932716428593847", 50, "GBP")
Step-4 : BigQuery dataset as Dataplex asset
Now, we’ll map the BigQuery dataset as an asset in Dataplex. In order to do so, first, we create a Dataplex lake called dq-demo-lake. To know more about how to create Dataplex lakes, zones and assets, please read the first two blogs of this series, Part-1 and Part-2.
We also create a zone called ‘transactions’ within the Dataplex lake and create it as a Raw Zone.
Let’s add the BigQuery dataset created earlier as an asset in the ‘transactions’ Dataplex zone.
We also create another dataset within the same project called ‘dataplex_dq_task_summary_dataset’ which will hold the results of the data quality task runs.
As a last step, we also create a Google Cloud Storage bucket called ‘dq-task-demo-config’ where the configuration file for writing the rules will be stored.
With this step, prerequisites for running the Dataplex data quality task is completed.
Step-4 : DQ rule configuration
We now define the configuration file with rules for data quality checks.
In this section, let’s break down each section of the rule file and understand what exactly it does.
The first section provides the metadata of the data asset, which in this case is a Dataplex table which discovers the BigQuery dataset.
metadata_registry_defaults: dataplex: projects: dataplex-dq-demo locations: us-central1 lakes: dq-demo-lake zones: TRANSACTIONS
Metadata registry defaults define default entity_uri values for each scheme.
For our case, since we are using Dataplex entities, we’ll use entity_uri to discover the table where data quality rules will be applied. Dataplex Data Quality tasks can also be applied to BigQuery tables which are not discovered using Dataplex. In that case, along with the following foursections, an ‘entities’ section will need to be created which describes the schema and the column details of the BigQuery table.
Broadly, apart from the metadata_registry_defaults, there are 4 sections within a configuration file.
row_filters: INTERNATIONAL_ITEMS: filter_sql_expr: |- REGEXP_CONTAINS(item_id, 'INTNL')
Row Filters define a filtering condition for data extracted from the respective entity to generate a sub-set of data for validation.
rule_dimensions: - consistency - correctness - duplication - completeness - conformance - integrity
Rule Dimensions define the allowed list of Data Quality rule dimensions that a Data Quality Rule can define in the corresponding dimension field.
rules: VALUE_ZERO_OR_POSITIVE: rule_type: CUSTOM_SQL_EXPR dimension: correctness params: custom_sql_expr: |- $column >= 0
Rules define reusable logic for data quality validation.
rule_bindings:TRANSACTION_AMOUNT_VALID: entity_uri: dataplex://projects/dataplex-dq-demo/locations/us- central1/lakes/dq-demo-lake/zones/transactions/entities/orders column_id: amount row_filter_id: NONE incremental_time_filter_column_id: last_modified_timestamp rule_ids: - VALUE_ZERO_OR_POSITIVE
Rule Bindings define a single Data Quality validation routine. Each value declared in entity_id / entity_url, column_id, filter_id, and rule_id is a lookup key for the more detailed configurations that must be defined in their respective configuration files.
Now that we understand the anatomy of a configuration file where rules are written, here is the sample file that we’ll use for applying the rules on the sample BigQuery table.
from data a
inner join (
group by $column_names
having count(*) > 1
$column >= 0VALID_ITEM_ID:
$column in ('GBP', 'JPY')rule_bindings:
Step-6 : Upload DQ rule config file to Google Cloud Storage
We need to upload the dq_demo_config.yaml rule configuration file in the cloud storage bucket that has been created earlier ‘dq-task-demo-config’.
Step-7 : Create a Dataplex Data Quality task
With the configuration file uploaded, we can now create a Dataplex Data Quality task from the Google Cloud console. The same task can also be created with gCloud CLI. Click on Dataplex -> Manage Lakes -> Process on the left navigation menu. Then click on Data Quality -> Create Task.
Then click on Check Data Quality to open the create task screen. In the Create task screen the following fields need to be filled
Along with the task name, the rule configuration file will also need to be selected, which was uploaded earlier in the designated cloud storage bucket.
For the results table, let’s select the BigQuery dataset created earlier to store results summary. Since, no table was created in that dataset, let’s add a table name and the table will be created by the task.
For this task, we schedule this to run immediately.
The task could also be scheduled to run Daily, Monthly, Weekly, as shown below :
Or, it could be run on a custom schedule as well, using UNIX cron expressions:
Once the task runs successfully, the Data Quality console shows a ‘Succeeded’ message
Step-8 : Task summary on BigQuery summary table
Now, that the task run has successfully, the BigQuery summary table will hold the summary statistics of the run.
If we drill down into the table dq_summary, here’s how the task run details are inserted into the table :
Each rule and rule_binding_id is collected with the success count, success percentage, failed count, failed percentage etc ..
Step-9 : Data Quality dashboard with Google Data Studio
Once, the BigQuery summary table is populated with the task run summary statistics, a dashboard could be built on top of this data using Google Data Studio, which can be activated by clicking on the Export -> Explore with Data Studio link on the table.
Here is an example of a sample data quality dashboard created using Google Data Studio
For this example, four charts are displayed :
- Percent data successfully validated for each rule_binding_id and dq_run_id
- Highest validation failure per table_id, column_id, rule_binding_id and rule_id
- Count of validation failures for each rule_binding_id and dq_run_id
- Count of validation successes, failures and nulls for each dq_run_id
Similarly, different data quality dashboards could also be created based on the metrics and dimensions and an overall quality score could be computed based on the success counts and failure counts and running an average over a designated data range.
To summarize, Dataplex Data Quality tasks could be used to build sophisticated Data Quality dashboards describing real-time Data Quality metrics. Data Quality tasks are natively integrated with Google Cloud data integration products and uses serverless Spark jobs to run dq tasks in a managed way.
Being an open source project, cloud-data-quality receives community feedback to add features and remains an easy and declarative way of establishing a data quality component within modern data platforms built on top of Google Cloud.
Read more :