Google Cloud Data Catalog hands-on guide: templates & tags with Python

Ricardo Mendes
Google Cloud - Community
5 min readJul 4, 2019

--

This quickstart guide is part of a series that brings a practitioner approach to Data Catalog, a recently announced member of Google Cloud’s Data Analytics services family.

Data Catalog is a fully managed and scalable metadata management service that empowers organizations to quickly discover, understand, and manage their data in Google Cloud.

The below content shows how to use Data Catalog’s tagging feature with the Python GRPC client library. This story is a sequel to Data Catalog hands-on guide: search, get & lookup with Python, so I recommend reading the previous one before starting this. And if you need more conceptual context before getting into practice, please take a look at the article I wrote describing my mental model about such features.

Environment setup

The environment required to run the samples is the same described in Data Catalog hands-on guide: search, get & lookup with Python. Please refer to the Environment setup section there for further details.

Templates & Tags using Python

As we’ve already seen, Search Catalog, Lookup Entry, and Get Entry are useful tools for discovering data in Google Cloud projects. But once we know and understand such data, what else can we do? How can we take advantage of Data Catalog to better manage it?

Data Catalog allows users and automated processes to tag data in GCP projects. To understand how it works, let’s explore the Templating and Tagging features.

Template

All tags created by Data Catalog are based on templates. TagTemplate is a Data Catalog native entity that represents a metadata schema, but not the same kind of metadata handled by Entry: tag templates represent custom/user-defined metadata. Or, in other words: Entry deals with technical metadata while TagTemplate deals with business metadata.

Let’s describe a TagTemplate to classify the tables we found using search and lookup. It will have only one field for a while: Has PII, a boolean. The JSON representation, according to Data Catalog’s Tag Template specification, is shown below:

{
"name": "..."
"displayName": "..."
"fields": {
"has_pii": {
"displayName": "Has PII"
"type": {
"primitiveType": BOOL
}
}
}
}

Simple, isn’t it? Now, take a look at the code to create it using the Python client library:

location = f'projects/{<project-id>}/locations/us-central1'tag_template = datacatalog.TagTemplate()
tag_template.display_name = 'A Tag Template to be used in the hands-on guide'
field = datacatalog.TagTemplateField()
field.display_name = 'Has PII'
field.type_.primitive_type = datacatalog.FieldType.PrimitiveType.BOOL
tag_template.fields['has_pii'] = field
datacatalog_client.create_tag_template(
parent=location,
tag_template_id='quickstart_classification_template',
tag_template=tag_template)

Any questions? The code is intended to be self-explanatory :). We could use similar code as many times as required to create a template composed of several fields.

To get the expected results, the service account needs at least Data Catalog TagTemplate Creator IAM role.

There are methods in Data Catalog’s API that allow us to modify a Tag Template — i.e., create_tag_template_field, delete_tag_template_field, rename_tag_template_field, and update_tag_template_field. As we are exploring the API, let’s see how create_tag_template_field works by using it to add a second field to the template: PII Type, an enum with values EMAIL and SOCIAL SECURITY NUMBER.

field = datacatalog.TagTemplateField()>>>
email_value = datacatalog.FieldType.EnumType.EnumValue()
email_value.display_name = 'EMAIL'
field.type_.enum_type.allowed_values.append(email_value)
ssn_value = datacatalog.FieldType.EnumType.EnumValue()
ssn_value.display_name = 'SOCIAL SECURITY NUMBER'
field.type_.enum_type.allowed_values.append(ssn_value)
<<<
datacatalog_client.create_tag_template_field(
parent=tag_template_name, tag_template_field_id='pii_type', tag_template_field=field)

Please notice the lines between the >>> and <<< marks, where we set the enum values through the allowed_values Repeated Message Field. After running this code we have the final TagTemplate:

name: "projects/<project-id>/locations/us-central1/tagTemplates/quickstart_classification_template"
display_name: "A Tag Template to be used in the hands-on guide"
fields {
key: "has_pii"
value {
display_name: "Has PII"
type {
primitive_type: BOOL
}
}
}
fields {
key: "pii_type"
value {
display_name: "PII Type"
type {
enum_type {
allowed_values {
display_name: "EMAIL"
}
allowed_values {
display_name: "SOCIAL SECURITY NUMBER"
}
}
}
}
}

Since we’re done with the template, let’s see next how to use it to create tags.

Tags

Tag is another Data Catalog native entity. It allows users/service accounts to attach business metadata for a given Entry, based on a TagTemplate.

In Data Catalog hands-on guide: search, get & lookup with Python we saw how to get the catalog entries for table_1 and table_2. And we created a template in the previous section, so all the information needed to create a Tag is available.

Take a look at the next piece of code:

(1)
tag_1 = datacatalog.Tag()
tag_1.template = tag_template.name
has_pii_field = datacatalog.TagField()
has_pii_field.bool_value = False
tag_1.fields['has_pii'] = has_pii_field
datacatalog_client.create_tag(parent=<table-1-entry>.name, tag=tag_1)print(tag_1.name)(2)
tag_2 = datacatalog.Tag()
tag_2.template = tag_template.name
has_pii_field = datacatalog.TagField()
has_pii_field.bool_value = True
tag_2.fields['has_pii'] = has_pii_field
pii_type_field = datacatalog.TagField()
pii_type_field.enum_value.display_name = 'EMAIL'
tag_2.fields['pii_type'] = pii_type_field
datacatalog_client.create_tag(parent=<table-2-entry>.name, tag=tag_2)print(tag_2.name)

It shows how to attach tags with different values to the tables, but using the same template. In brief, now a Data Catalog user will know table_1 has no PII information, and table_2 stores e-mails.

A Tag can be attached to a specific table column, instead of the table itself, by setting tag.column = column_name.

To succesfully attach tags, the service account needs at least Data Catalog TagTemplate User IAM role. Addtional permissions/custom roles will be required depending on the type of the data asset the entry refers to. E.g., bigquery.datasets.updateTag for BigQuery Datasets, bigquery.tables.updateTag for BigQuery Tables, and pubsub.topics.updateTag for Pub/Sub Topics.

The expected output (names) for this snippet is presented below. Please notice tags are resources that belong to entries, not to templates, and tags’ IDs are system generated.

projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<table-1-entry-id>/tags/<tag-1-id>projects/<project-id>/locations/US/entryGroups/@bigquery/entries/<table-2-entry-id>/tags/<tag-2-id>

The same approach is used to attach tags to BigQuery Datasets or Pub/Sub Topics. No tricks at this point.

Search Catalog revisited

Since we tagged catalog entries, how can this information be used for future searches? Do tags bring any advantage for Search Catalog? Yes, of course! The tag search qualifier allows us to look for assets tagged with a given template or value, and this search capability may be used to easily manage/audit data as more and more classification work is done.

For example:

datacatalog_client.search_catalog(scope=scope, query='tag:quickstart_classification_template')

will return search results for table_1 and table_2, while

datacatalog_client.search_catalog(scope=scope, query='tag:quickstart_classification_template.has_pii=True')

will return only a result for table_2.

Done!

This series brings an initial overview of Google Cloud Data Catalog with an end-to-end case study, from data discovery to user-defined business metadata management, providing also a mental model as background. They compose the basic set of capabilities that make Data Catalog the ideal tool for Data Governance support in companies of any size.

Basic Python client library usage was also covered. The samples can be easily migrated to Java, NodeJS, or other languages (check the official docs for availability). Also, consider using Data Catalog in Cloud Console when getting started, you might get useful insights there.

That’s all, folks!

The source files used to generate above code samples are available on GitHub: https://github.com/ricardolsmendes/gcp-datacatalog-python.

Changelog

  • 2020–10–03: Updated the code snippets to enforce compliance with version 2.0.0 of the client library.
  • 2021–02–15: Updated the code snippets to enforce compliance with version 3.0.0 of the client library.

--

--