Setting Up Data Quality Monitoring For Cloud Dataprep Pipelines

Published in

Google Cloud - Community

12 min readFeb 1, 2021

Building a Data Quality Dashboard

Building a modern data stack to manage analytic pipelines — such as Google Cloud and a BigQuery data warehouse or data lake — has many benefits. One such benefit is the ability to automatically monitor the quality of your data pipelines. You can ensure that accurate data is fueling your analytics, track data quality trends, and, if a data quality issue does arise, react quickly to resolve it.

Let’s review how to create a beautiful Data Studio dashboard to monitor Cloud Dataprep data quality pipelines.

For this article, let’s assume that you’re responsible for managing a data pipeline for a Sales Cloud Data Warehouse (DWH) and you want to monitor Data Quality (DQ) for this particular Sales DWH. You’ll have to create two distinct but correlated pipelines:

Pipeline 1 for loading the actual Sales DWH
Pipeline 2 for monitoring DQ

When the Sales DWH is loaded, it will trigger the DQ Pipeline.

The solution from a high level perspective follow these principles that you will implement throughout this article:

Switch on job “Profiling Results” and “Data Quality Rules” for the Sales DWH datasets (Sales and Advertising Clickstream).
Execute the Sales Dataprep jobs to load the Sales DWH.
Execute a webhook when the Sales DWH is loaded to trigger the DQ Pipeline
Report on the Google BigQuery tables with Google Data Studio to track the Data Quality trends for the Sales DWH.

Here is an example of a Google Data Studio report you will be creating. This blog explains in more details what it provides.

Figure — Google Data Studio report with Dataprep Data Quality trends

Building the Data Quality Pipeline with Cloud Dataprep

Here is an overview of the entire solution used to capture Cloud Dataprep Data Quality statistics and load them to BigQuery. The automation piece comes from Dataprep jobs, while Data Studio handles reporting.

Figure — Dataprep flows to monitor the Sales Data Warehouse Data Quality Pipeline

The upper lane represents the actual Sales DWH, which is loading sales and advertising data into a DWH.
As soon as the Cloud Dataprep job is finished executed, it executes a webhook (a task) that triggers a second Cloud Dataprep job.
This second job in the lower lane will load the Data Profiling and Data Quality Rules results (in a form of JSON files) into a BigQuery table in order to report on it with Data Studio.

1. Pre-requisites

Be familiar with Profiling & Data Quality Rules:

For the datasets you want to monitor DQ, you must enable “Profiling Results” and “Data Quality Rules”. Please be familiarized with these features. These are the essential statistics we will be leveraging to build the DQ Dashboard. Each time you run a job with Trifacta, some Data Quality statistics are created and can be accessed via APIs or through JSON files. We will be leveraging the JSON files to build the Data Quality database. You will have to enable Profiling & Data Quality Rules in the jobs and flows respectively. If you don’t have a flow yet to monitor, no worries, we will be leveraging the Sales DWH flow provided below in the article.

Downloads:

All the assets for this blog are available in this GitHub

Download flow_Profiling Quality Rules Processing.zip which is the DQ flow that extracts the Profiling and Data Quality results and loads them into BigQuery tables

Download flow_Data Quality_Clickstream_and_Sales.zip which is the Sales DWH flow example used in this blog. If you wish, you can use your own Cloud Dataprep flow for which you want to monitor the Data Quality for.

Download Advertising_Clickstream.csv and download Sales_Data_small.csv which are the source example for the Sales DWH

You can make your own copy of the Data Studio dashboard [PUBLIC] Cloud Dataprep Profiling & Data Quality Dashboard to customize it to your Data Quality monitoring needs

Service Access:

You will need a valid Google account and access to Cloud Dataprep and Google BigQuery in order to try it out. You can go to the Google Console https://console.cloud.google.com/ to activate those services.

Calling APIs:

To call APIs you’ll need an Access Token, which you can generate from your Preferences page in Cloud Dataprep.

Figure — Get the Access Token from the Setting menu

2. Create Cloud Dataprep Data Quality Flow

Figure — Dataprep Flow Parsing Profiling & DQ Rules Results to Load in BigQuery

Import the DQ Flows in Cloud Dataprep

If not done yet, download flow_Profiling Quality Rules Processing.zip DQ flow and import it (without unzipping it) in your Dataprep environment. In the Cloud Dataprep application, click the Flows icon in the left nav bar. Then In the Flows page, select Import from the context menu.

Figure — Import the DQ Flow in your Dataprep environment

Figure — “Profiling & Quality Rules Processing” Flow imported

These two flows parse the JSON files and convert them into columnar BigQuery table format for simplified reporting in Data Studio. If you’re curious, you can take a look at the recipes “Profiler Rules” and “Profiler Check” to understand the logic.

Prepare for the API call

You now need to get the ID of this Flow. This will be used to trigger the 2 jobs from the DQ flow you just imported with an API call later on.

You can get the id from the Flow URL:

Figure — Getting the <flow_id> from the Dataprep flow URL

In my example <flow_id> = 355168

Now, you need to customize this Flow to your own personal Dataprep configuration and settings. In the next section, you acquire the Profiling and Data Quality Rules output that were generated as part of one of your job runs (in our example the Sales DWH). Then, you feed them into your DQ pipeline to load them into BigQuery.

Get the Profiling and Data Quality rules results file names

When Profiling is enabled and if you have Data Quality Rules defined, Dataprep generates 3 JSON files at the end of the job execution located into your Google Cloud Storage staging bucket in this default folder :

/<user_name>/jobrun/<output_name>_<jobid>/.profiler

Where:

<user_name> = GCP user (account email) running the Dataprep job
<output_name> = Dataprep Output object name
<job_id> = Dataprep job id

And the 3 (multi-part) JSON files generated are:

profilerRules.json = Data Quality Rules results
profilerTypeCheckHistograms.json = Profiling information with Valid, Missing an Mismatched from all columns
profilerValidValuesHistograms.json = Profiling information with top 20 and statistics from all columns

For our current example, our <staging_bucket_name> is “dataprep-staging-0b9ad034–9473–4777–98f1–0f3e643d0dce”, and we have used the default <jobrun_folder>. For this example we will only use the two first JSON profiling files. You may want to extend later on the solution with the last data quality file that contains doe extra statistics.

Figure — Profiling and Data Quality JSON files generated in GSC staging bucket

Note that the default <jobrun_folder> used in this example can have been changed in your own environment. Back to Cloud Dataprep application, from the User Preferences menu in the left nav bar you can verify the <jobrun_folder> used for your account. You can find this setting in your User Preferences and Profile information.

Figure — Check the Job Run directory in Cloud Dataprep User Preferences

Update the DQ Flow with new Parameterized Inputs Datasets

We can use the two files from a job execution (profilerRules.json and profilerTypeCheckHistograms.json) as the basis for creating the imported datasets for the DQ flow. These flows are accessed via paths.

With Cloud Dataprep, when you create a new imported datasets, you can parameterize parts of the paths, which allows you to create imported datasets that match all such files from all of your job runs. These datasets are called datasets with parameters.

In the “Profiling & Quality Rules Processing” Flow previously imported, you need to update the two “Profiler input datasets” with your <staging_bucket_name> and <jobrun_folder>.

For this, we advise you to run the job a first time with “Profiling Results” and “Data Quality Rules enabled so you can later on update the flow with an existing path and Profiler json files.

Figure — Path and Parameters of the 2 Profiler input datasets

Edit parameters for the 2 datasets with the proper path.

First from the Input Dataset Details panel, select the “Edit parameters…” item from the right nav menu ‘…’.

Figure — Edit parameters

Then with the “Browse” button, you can find your staging bucket and create the path with the 3 variables <user_name>, <output_name> and <job_id> to find your Profiler json files like below. I advise first to find an existing profilerRules.json file and then replace user, output dataset name and job id, with the 3 variables.

Figure — Define the Parameterized Path

After your 2 input datasets have been updated, check that you can see some data in the Data Preview from the Details pane for each of the recipes.

Figure — Recipe Data Preview

If you cannot see any data, verify that your path is correct and that you can see data in the dataset.

You can also edit the 2 recipes to check input dataset and research what could have gone wrong.

We won’t explain these recipes in detail here, but you can edit and modify them to tune them to your needs.

Update the DQ Flow with BigQuery outputs tables

Finally you need to update the outputs for the 2 recipes to ingest (in Append mode) the Profiling and Data Quality Rules results into 2 BigQuery tables. These 2 BigQuery tables will be automatically created by Dataprep the first time you run the job, then the next run will just append new data in these 2 tables. This way you will keep track of the data quality history.

Figure — Output Destinations in BigQuery tables

Checkpoint: You have successfully imported the data quality flow, customized it for your needs with proper file paths, and run the 2 DQ jobs to populate the BigQuery tables.

You can finally run the job to produce these 2 outputs, and check that your 2 BigQuery tables are populated with Profiling and Data Quality results.

3. Create a Cloud Dataprep Flow and Configure a Webhook

Figure — Calling the DQ Flow with a Webhook when Sales DWH job finishes

Ok, we’re almost done — but not quite. We now need to create the Sales DWH flow available here as an example, which is what we’ll use to monitor our data quality. You may prefer to use one of your existing flows in your own Dataprep project. Within the Flow you have selected to monitor the datasets data quality, we will call the above-mentioned Dataprep DQ Flows to publish the Profiling and Data Quality Results into your Google BigQuery tables.

Calling the Data Quality flow will be done thanks to a Webhook (an external task call — if not familiar with Webhooks, you can read the documentation here) notification letting you define outgoing HTTP messages to any REST API ().

Let’s see what it looks like. Here is how to configure a Webhook task in your flow that will call your “Profiling & Quality Rules Processing” Flow and run the two jobs.

Figure — Creating a Cloud Dataprep flow and configure a Webhook task on a flow

The Webhook task needs to be configured with this information:

<url> : It’s the Cloud Dataprep API endpoint to run the “Profiling & Quality Rules Processing” Flow.

The default Cloud Dataprep API endpoint to run a Flow is https://api.clouddataprep.com/v4/flows/<flow_id>/run

Where <flow_id> is the ID of the “Profiling & Quality Rules Processing” Flow you have retrieved from the previous step.

Headers like showed in the screenshot below with <content-type> and the value <application/json>, <authorization> and the value <Bearer access_token>

The Dataprep <access_token> is used to call the Dataprep API, see “Getting Started” chapter.

Body with the below value as shown in the below screenshot

{
“runParameters”: {
“overrides”: {
“data”:[
{“key”:”user”,”value”:”vcoustenoble@trifacta.com”},
{“key”:”jobid”,”value”:”$jobId”},
{“key”:”output_dataset”,”value”: “Advertising_Clickstream”}
]
}}}

Note that for the <user> you need to put the GCP’s user account (email) running the Dataprep job and for <output_dataset> the name of the output object defined in your current flow.

Trigger event, you can decide to trigger the Webhook only when the job is successful.
Trigger object, you have to trigger the Webhook for only a specific output in the flow (the one you put in the Body parameter).

When you have entered this information you can test your Webhook task.

Figure — Webhook task parameters to call the “Profiling & Quality Rules Processing” Flow

After you save the Webhook task, then it’s ready to be called when the job is executed.

Figure — Webhook task created

Don’t forget to check “Profile results” in Publishing settings of your output. Without checking this, the profiling results will not be generated and it will defeat the whole purpose of the solution.

Figure — Profile result checked

Checkpoint: You have successfully configured a Webhook on the Sales DWH Flow that triggers when the job has finished, the webhook triggers the DQ Flow to populate BigQuery DQ Tables.

You can configure a Webhook for each of outputs of your DWH flows if you want to get and monitor Profiling and Data Quality rules of all your outputs.

4. Testing the end-to-end process

You are now ready to test the end-to-end process by running a job from your Dataprep job and see the Data Quality in your DataStudio report.

From the DW Flow, run a job (by clicking “Run Job” button) on the Advertising_Clickstream output for example.

Figure — Run a Dataprep job

When the “Advertising_Clickstream” job is complete, you can check in Jobs result page, from the “Webhooks” tab of the job, that the Webhook have been well triggered:

Figure — Dataprep job result and Webhook task completed

You can also check in Job results page, that the 2 jobs “Profiler Rules” and “Profiler Check” have been well started:

Figure — Dataprep jobs running from the “Profiling & Quality Rules Processing” Flow

And when these 2 “Profiler Rules” and “Profiler Check” jobs are complete, you can check into your 2 BigQuery tables, than Profiling and DQ Rules results have been well inserted in the tables:

Figure — Data Quality rules results in the Google BigQuery table

Finally, you can compare Data Quality rules in Dataprep UI with Data Quality rules in your Data Studio dashboard. It should be the same values.

Figure — Data Quality Rules result displayed in the Dataprep UI

Figure — Data Quality rules result displayed in the Google Data Studio dashboard

Conclusion

By achieving this step-by-step guide, you can now publish Dataprep Profiling and Data Quality Rules results to Google BigQuery tables, which further allows you to create a Data Studio dashboard that monitors Data Quality trends and displays the information to your broader team.

You have learned some Cloud Dataprep advanced techniques about:

How to leverage Cloud Dataprep Profiling and Data Quality Results
Cloud Dataprep Webhook
Cloud Dataprep API

You can extend the solution to monitor additional Google Cloud services for an end-to-end data quality pipeline monitoring.

You are now ready to automate the monitoring of your entire Data Quality pipelines with Cloud Dataprep.