Update Google Data Catalog Tags with Cloud Dataprep Metadata
This post is linked to the Github repository https://github.com/victorcouste/google-data-catalog-dataprep explaining how to create or update Google Cloud Data Catalog tags on BigQuery tables with Cloud Dataprep Metadata and Column’s Profile via a Python Cloud Function.
The 2 Data Catalog tags created or updated:
- Dataprep Job Metadata tag attached to the BigQuery table and containing information from the Dataprep job used to create or update the BigQuery table : the user, Dataprep Job (id, name, url, timestamp), Dataprep Dataset (id, name, url), Dataprep Flow (id, name, url), Job Profile (url and number of valid, invalid an empty values) and the Dataflow job (id, url).
- Dataprep Job Column’s Profile tag attached to all BigQuery table columns and containing number of valid, invalid and empty values for each column.
To activate, learn and use Cloud Data Catalog, go to https://cloud.google.com/data-catalog and https://console.cloud.google.com/datacatalog.
The Github repository contains the Cloud Function Python code triggered from a Dataprep Webhook to create or update the 2 Data Catalog tags.
This Cloud Function uses:
In your Cloud Function, you need the 5 files:
- main.py
- config.py where you need to update your GCP project name (where Tags Template are created) and the Dataprep Access Token (to use Dataprep API). You can also update the 2 tag templates ID if needed.
- datacatalog_functions.py to get or update Data Catalog objects.
- dataprep_metadata.py to get Cloud Dataprep metadata.
- requirements.txt
Before running the Cloud Function (and create or update tags), you need to create the 2 Data Catalog Tag Templates for Dataprep (Job Metadata and Job Column Profile).
For this action, you can use:
- Cloud Console where you can manage your Tag Templates.
- gcloud and the command
gcloud data-catalog tag-templates create.
You can find the full command line in gcloud_tag-templates_create.sh, and more details in GCP documentation with and example and reference. But be aware that with a gcloud command line, you cannot manage template tag fields’s order, fields will be in alphabetical order. - REST API with the 2 tag template json files dataprep_metadata_tag_template.json and dataprep_column_profile_tag_template.json. Explanation and details to use the REST API in GCP documentation with an example and reference.
Then, when the Cloud Function has been created, to use it you just have to pass the Dataprep Job ID in a JSON format like {"job_id":"7827359"}
.
And to trigger it from a Cloud Dataprep flow, you can use a Webhook on the Cloud Function endpoint with {"job_id":"$jobId"}
in the POST body.
When Data Catalog template tags are created and when tags are created or updated on BigQuery tables, you can find all results from the GCP console interface https://console.cloud.google.com/datacatalog.
Finally, you can also search BigQuery tables in Cloud Data Catalog with a Dataprep tag from your own application like https://github.com/victorcouste/dataprep-datacatalog-explorer
Happy wrangling and happy tagging !