Data Governance with Dataplex — A sneak peak into the latest launches

Mandar Chaphalkar
Google Cloud - Community
4 min readDec 27, 2022
Source: Gartner

The challenge with poor quality of data is not new. It has been around right from the era of legacy data warehouses (DWH) and has grown manifolds with modern data lakes. While organisations have reaped the benefits of DWH / data lakes on cloud, the challenges with data quality (DQ) still persist on account of a variety of factors ranging from strategy, technology, costs and so on.

So the question is how do we solve this in a simple and efficient way?

Stating the obvious — You cannot fix it if you cannot detect it! We are going to focus on the detect part in this blog.

Case:

Let us consider a scenario of a retail company that have implemented their Data lake / DWH on Google Cloud — BigQuery. The company is very mindful of governance aspects and have implemented Dataplex to address these.

Raw and Curated data zones in Retail Analytics Data Lake

The company does receive the orders / transactions data from a bespoke order management application.

Illustrative schema for the orders table

The Data Governance lead suspects that the data in the orders table is not the best from a quality perspective and this needs further investigation.

Solution:

Now, there are two primary components to the investigation part; A — Profile (Preview) and B— Data Quality (Preview)

A. Profile

Step A1: Click on “Create Data Profile Scan” within the Google Cloud Console > Dataplex > Govern > Profile

Step A2: Select the suspect BigQuery table that has to be scanned for quality issues. Here we select the orders table. The scope of the scan can be both full as well as incremental (based on a specific date column).

Step A3: The scan can be triggered both on demand as well as per specific schedule.

Step A4: The profiling results in standard results for the data such as Null, Unique percentage and other statistics.

In the example above a 34% Null statistic could be of interest from a DQ perspective for an ongoing check. The reason being having a NULL shipped_at timestamp would lead to an incorrect calculation of a metric like transit duration.

That brings us to the second part of the investigation — the ongoing checks > B. Data Quality

Step B1: Click on “Create Data Quality Scan” within the Google Cloud Console > Dataplex > Govern > Data Quality

Step B2: Define the scan with information such as the table, incremental / full. The next step here is to define the DQ rule for validation and there are multiple options over here.

Step B3: Together with the standard DQ check such as range, null, value-set et al, it is possible to make a selection based on a reccomendation from the data profiling step as well.

Step B4: See below all the recommendations that have been made from the profiling task. For our example we will select a standard one i.e. Null check for the DQ validation.

Step B5: We have triggered an on demand scan and the job has been completed successfully. The validation has resulted in a failure for completeness. This indicates that the test data does have nulls for the shipped_at column even on an ongoing basis.

On the basis of the outcome above, the DQ action within the ETL pipeline can be defined. Now that we know that the shipped_at column can be NULL on an ongoing basis, we can attach a step in the Dataflow or Data fusion or the BQ Procedure that handles the ETL operation on this data.

In conclusion,

With the new functionalities Profile (Preview) and Data Quality (Preview) on Dataplex it is possible to carry out the profiling and validation of data within the Data lake in few simple steps. The profiling step is particularly intuitive on the checks that need to be performed on an ongoing basis. On the basis of the DQ checks corrective action for cleansing can be determined and defined in the data processing solution component.

--

--

Mandar Chaphalkar
Google Cloud - Community

Data Analytics Specialist at Google | *Views, opinions are personal*