Monitor your BigQuery Data Warehouse Dataprep Pipeline with Data Studio
Maybe if you manage your data analytics pipeline on a modern data stack such as Google Cloud for your BigQuery data warehouse or data lake, you would like to monitor it and get a comprehensive view of the end-to-end analytics process to react quickly when something breaks, or just get the peace of mind when everything works properly.
In a previous article, I have demonstrated how to monitor your Dataprep jobs with a simple Google Sheet, but why not boost it to the next level with a beautiful and actionable dashboard in Data Studio?
This article explains how to capture Cloud Dataprep jobs statuses and details via APIs leveraging Cloud Functions. We then input this information in a Google BigQuery table to monitor jobs from a Data Studio report connected to this table.
To illustrate this concept, we will assume you want to monitor a daily scheduled Dataprep job and with a quick look into a Data Studio dashboard get an overview of potential failure. You will be able to check the output dataset name, the job time duration, the user name, the Dataflow job id and drill into Dataprep job details and profile results directly in the Cloud Dataprep console.
This article is a step by step guide going through this process of triggering a Cloud Functions when a Cloud Dataprep job is finished and publishing the job results, status and direct links into a Google BigQuery table for reporting in Google Data Studio.
Here is an example of a Google Data Studio report with Cloud Dataprep job results monitoring.
Here is an overview of the entire solution to capture Dataprep job results and load them to BigQuery thanks to Cloud Functions and APIs for the automation and Data Studio for reporting.
1. Getting Started
To make this guide practical we are sharing here in Github the Python code for the Cloud Function and here the Data Studio dashboard.
You need a valid Google account and access to Cloud Dataprep, Cloud Functions and Google BigQuery to try it out. You can start from the Google Console https://console.cloud.google.com/ to activate the various services.
REMARK: To call APIs, one needs an Access Token and you can generate it from your Preferences page.
2. Create the BigQuery table to store job status and details
In a BigQuery dataset, create the following table schema to store all job information:
CREATE TABLE `default.dataprep_jobs`
(
job_run_date DATETIME,
job_id INT64,
output_name STRING,
job_status STRING,
job_url STRING,
user STRING,
dataflow_job_id STRING,
duration STRING
)
3. Create the HTTP Cloud Function to publish in the Google BigQuery table
First, we need to create the HTTP Cloud Function that will be triggered as a Webhook when a Dataprep job has finished.
Create a Cloud Function from the Google Cloud console here.
The trigger type must be “HTTP”.
Give it a name and get an URL similar to https://us-central1-dataprep-premium-demo.cloudfunctions.net/Job-Result-Google-BigQuery.
Later on, we will need this URL while creating the Webhook in Dataprep.
And in our example, we will use the Python code, provided above, as the Runtime under the Source Code section.
If you need to explore more about Cloud Functions, I recommend you look through this tutorial.
The Cloud Function code follows this logic:
- Retrieve the Job id and status (Failed or Complete) provided by Dataprep.
- Leverage the Dataprep user Access Token in the API call in order to be authenticated by Cloud Dataprep.
- Get more information (output dataset name, user name, duration, Dataflow job id) about the job with getJobGroup Dataprep API call. Documentation on this Dataprep API endpoint can be found here https://clouddataprep.com/documentation/api/#operation/getJobGroup
- Insert job details in the Google BigQuery table (Job result page url is https://clouddataprep.com/jobs/<jobID>.
The full Python code is in the Github.
You need to edit and replace the highlighted values in bold with the proper ones you retrieved from your Cloud Dataprep project.
- Access Token to call Dataprep API:
dataprep_auth_token = “eyJhdjkghd……ezhjkdfsghk”
- Google BigQuery dataset and table name:
# Prepares a reference to the dataset
dataset_ref = bigquery_client.dataset(‘default’)
table_ref = dataset_ref.table(‘dataprep_jobs’)
You also need to add the following dependencies to your Python Cloud Function (see requirements.txt tab as in the screenshot below) :
requests==2.24.0
google-cloud-bigquery
datetime
You then need to deploy the Cloud Function. After it is deployed, the Cloud Function is up and running, waiting to be called from Cloud Dataprep when a job is executed. You can learn more here about deploying and executing Cloud Functions.
4. Create a Cloud Dataprep flow and Configure a Webhook
Next, you need to create the Cloud Dataprep flow that will call the HTTP Cloud Function to publish the job result and details into your Google BigQuery table.
And you need to create and configure a Webhook task in your flow that will call your HTTP Cloud Function.
The Webhook task needs to be configured with this information:
- URL : It’s the URL of your HTTP Cloud Function you previously created. For example https://us-central1-dataprep-premium-demo.cloudfunctions.net/Dataprep-Webhook-Function
- Headers like showed in the screenshot below with content-type and application/json
- Body with the value {“jobid”:”$jobId”,”jobstatus”:”$jobStatus”} as shown in the below screenshot
- Trigger event, you can decide to trigger the Webhook for any status, or just for jobs failed or completed.
- Trigger object, you can decide to trigger the Webhook for only specific outputs in the flow, or for any job executed in the flow.
When you have entered this information you can test your Webhook task.
After you save the Webhook task it is then ready to be called when the job is executed.
5. Testing the end-to-end process
You are now ready to test the end-to-end process by running a job from your Dataprep job and see the job status and details added to your BigQuery table.
Lastly, you can also check proper execution details (API call with the parameter and Cloud Dataprep job status) by reviewing the Google Cloud Functions logs located here.
Conclusion
By achieving this step-by-step guide you have the fundamental principles accomplished to automatically publish Dataprep Job results in a Google BigQuery table so you can monitor and share summary information easily to a broader team through an attractive Data Studio dashboard.
You have learned about:
- Cloud Dataprep API
- Cloud Dataprep Webhook
- HTTP Cloud Functions calling an API
You can also extend this solution to monitor additional Google Cloud services for an end-to-end data pipeline monitoring.
You are now ready to automate the monitoring of your job status.
You can also automate Cloud Dataprep leveraging another Cloud Function or an external scheduler. Take a look at these articles explaining how to orchestrate Cloud Dataprep jobs using Cloud Composer and how to automate a Cloud Dataprep pipeline when a file arrives in Cloud Storage.