Monitor your BigQuery Data Warehouse Dataprep Pipeline with Data Studio

Victor Coustenoble
Google Cloud - Community
6 min readDec 18, 2020

--

Maybe if you manage your data analytics pipeline on a modern data stack such as Google Cloud for your BigQuery data warehouse or data lake, you would like to monitor it and get a comprehensive view of the end-to-end analytics process to react quickly when something breaks, or just get the peace of mind when everything works properly.

In a previous article, I have demonstrated how to monitor your Dataprep jobs with a simple Google Sheet, but why not boost it to the next level with a beautiful and actionable dashboard in Data Studio?

This article explains how to capture Cloud Dataprep jobs statuses and details via APIs leveraging Cloud Functions. We then input this information in a Google BigQuery table to monitor jobs from a Data Studio report connected to this table.

To illustrate this concept, we will assume you want to monitor a daily scheduled Dataprep job and with a quick look into a Data Studio dashboard get an overview of potential failure. You will be able to check the output dataset name, the job time duration, the user name, the Dataflow job id and drill into Dataprep job details and profile results directly in the Cloud Dataprep console.

This article is a step by step guide going through this process of triggering a Cloud Functions when a Cloud Dataprep job is finished and publishing the job results, status and direct links into a Google BigQuery table for reporting in Google Data Studio.

Here is an example of a Google Data Studio report with Cloud Dataprep job results monitoring.

Fig. 1 — Google Data Studio report with Dataprep Job results

Here is an overview of the entire solution to capture Dataprep job results and load them to BigQuery thanks to Cloud Functions and APIs for the automation and Data Studio for reporting.

Fig. 2 — high-level process to trigger a Cloud Function based on a Cloud Dataprep job execution.

1. Getting Started

To make this guide practical we are sharing here in Github the Python code for the Cloud Function and here the Data Studio dashboard.

You need a valid Google account and access to Cloud Dataprep, Cloud Functions and Google BigQuery to try it out. You can start from the Google Console https://console.cloud.google.com/ to activate the various services.

REMARK: To call APIs, one needs an Access Token and you can generate it from your Preferences page.

Fig. 3 — Get the Access Token from the Setting menu

2. Create the BigQuery table to store job status and details

In a BigQuery dataset, create the following table schema to store all job information:

CREATE TABLE `default.dataprep_jobs`
(
job_run_date DATETIME,
job_id INT64,
output_name STRING,
job_status STRING,
job_url STRING,
user STRING,
dataflow_job_id STRING,
duration STRING
)

Fig. 4 The BigQuery table schema for your Dataprep jobs

3. Create the HTTP Cloud Function to publish in the Google BigQuery table

First, we need to create the HTTP Cloud Function that will be triggered as a Webhook when a Dataprep job has finished.

Create a Cloud Function from the Google Cloud console here.

The trigger type must be “HTTP”.

Give it a name and get an URL similar to https://us-central1-dataprep-premium-demo.cloudfunctions.net/Job-Result-Google-BigQuery.

Later on, we will need this URL while creating the Webhook in Dataprep.

And in our example, we will use the Python code, provided above, as the Runtime under the Source Code section.

Fig. 5 Create the Cloud Function to be called from Dataprep

If you need to explore more about Cloud Functions, I recommend you look through this tutorial.

The Cloud Function code follows this logic:

  1. Retrieve the Job id and status (Failed or Complete) provided by Dataprep.
  2. Leverage the Dataprep user Access Token in the API call in order to be authenticated by Cloud Dataprep.
  3. Get more information (output dataset name, user name, duration, Dataflow job id) about the job with getJobGroup Dataprep API call. Documentation on this Dataprep API endpoint can be found here https://clouddataprep.com/documentation/api/#operation/getJobGroup
  4. Insert job details in the Google BigQuery table (Job result page url is https://clouddataprep.com/jobs/<jobID>.
Fig. 6 — Python code to get job details and insert job result status in a Google BigQuery table

The full Python code is in the Github.

You need to edit and replace the highlighted values in bold with the proper ones you retrieved from your Cloud Dataprep project.

  • Access Token to call Dataprep API:

dataprep_auth_token = “eyJhdjkghd……ezhjkdfsghk

  • Google BigQuery dataset and table name:

# Prepares a reference to the dataset
dataset_ref = bigquery_client.dataset(‘default’)
table_ref = dataset_ref.table(‘dataprep_jobs’)

You also need to add the following dependencies to your Python Cloud Function (see requirements.txt tab as in the screenshot below) :

requests==2.24.0
google-cloud-bigquery
datetime

Fig. 7 — Python dependencies packages

You then need to deploy the Cloud Function. After it is deployed, the Cloud Function is up and running, waiting to be called from Cloud Dataprep when a job is executed. You can learn more here about deploying and executing Cloud Functions.

4. Create a Cloud Dataprep flow and Configure a Webhook

Next, you need to create the Cloud Dataprep flow that will call the HTTP Cloud Function to publish the job result and details into your Google BigQuery table.

And you need to create and configure a Webhook task in your flow that will call your HTTP Cloud Function.

Fig. 8 — Creating a Cloud Dataprep flow and configure a Webhook task on a flow

The Webhook task needs to be configured with this information:

  • URL : It’s the URL of your HTTP Cloud Function you previously created. For example https://us-central1-dataprep-premium-demo.cloudfunctions.net/Dataprep-Webhook-Function
  • Headers like showed in the screenshot below with content-type and application/json
  • Body with the value {“jobid”:”$jobId”,”jobstatus”:”$jobStatus”} as shown in the below screenshot
  • Trigger event, you can decide to trigger the Webhook for any status, or just for jobs failed or completed.
  • Trigger object, you can decide to trigger the Webhook for only specific outputs in the flow, or for any job executed in the flow.

When you have entered this information you can test your Webhook task.

Fig. 9 — Webhook task parameters to call the Cloud Function

After you save the Webhook task it is then ready to be called when the job is executed.

Fig. 10 — Webhook task created

5. Testing the end-to-end process

You are now ready to test the end-to-end process by running a job from your Dataprep job and see the job status and details added to your BigQuery table.

Fig. 11 — Run a Dataprep job
Fig. 12 — Dataprep job result and Webhook task completed
Fig. 13 — Job result status and details published in the Google BigQuery table
Fig. 13 — Job result status and details displayed in the Google Data Studio dashboard

Lastly, you can also check proper execution details (API call with the parameter and Cloud Dataprep job status) by reviewing the Google Cloud Functions logs located here.

Fig. 14 — Cloud Functions logs

Conclusion

By achieving this step-by-step guide you have the fundamental principles accomplished to automatically publish Dataprep Job results in a Google BigQuery table so you can monitor and share summary information easily to a broader team through an attractive Data Studio dashboard.

You have learned about:

You can also extend this solution to monitor additional Google Cloud services for an end-to-end data pipeline monitoring.

You are now ready to automate the monitoring of your job status.

You can also automate Cloud Dataprep leveraging another Cloud Function or an external scheduler. Take a look at these articles explaining how to orchestrate Cloud Dataprep jobs using Cloud Composer and how to automate a Cloud Dataprep pipeline when a file arrives in Cloud Storage.

--

--