Update Google Data Catalog Tags with Cloud Dataprep Metadata

Victor Coustenoble
Google Cloud - Community
3 min readFeb 23, 2021

This post is linked to the Github repository https://github.com/victorcouste/google-data-catalog-dataprep explaining how to create or update Google Cloud Data Catalog tags on BigQuery tables with Cloud Dataprep Metadata and Column’s Profile via a Python Cloud Function.

The 2 Data Catalog tags created or updated:

  • Dataprep Job Metadata tag attached to the BigQuery table and containing information from the Dataprep job used to create or update the BigQuery table : the user, Dataprep Job (id, name, url, timestamp), Dataprep Dataset (id, name, url), Dataprep Flow (id, name, url), Job Profile (url and number of valid, invalid an empty values) and the Dataflow job (id, url).
Example of a Cloud Dataprep Metadata Tag in Data Catalog
Example of a Cloud Dataprep Column Profile Tag in Data Catalog

To activate, learn and use Cloud Data Catalog, go to https://cloud.google.com/data-catalog and https://console.cloud.google.com/datacatalog.

The Github repository contains the Cloud Function Python code triggered from a Dataprep Webhook to create or update the 2 Data Catalog tags.

This Cloud Function uses:

In your Cloud Function, you need the 5 files:

Before running the Cloud Function (and create or update tags), you need to create the 2 Data Catalog Tag Templates for Dataprep (Job Metadata and Job Column Profile).

Cloud Dataprep Metadata Tag Template
Cloud Dataprep Column Profile Tag Template

For this action, you can use:

Then, when the Cloud Function has been created, to use it you just have to pass the Dataprep Job ID in a JSON format like {"job_id":"7827359"}.

And to trigger it from a Cloud Dataprep flow, you can use a Webhook on the Cloud Function endpoint with {"job_id":"$jobId"} in the POST body.

Cloud Dataprep Webhook to call the Data Catalog Cloud Function

When Data Catalog template tags are created and when tags are created or updated on BigQuery tables, you can find all results from the GCP console interface https://console.cloud.google.com/datacatalog.

Finally, you can also search BigQuery tables in Cloud Data Catalog with a Dataprep tag from your own application like https://github.com/victorcouste/dataprep-datacatalog-explorer

Happy wrangling and happy tagging !

--

--