Dataplex — Data Catalog | Tagging Operational and Business Metadata| Part — 3.2

Nishit Kamdar
Google Cloud - Community
8 min readOct 18, 2023
Photo by Angèle Kamp on Unsplash

This is Part-3.2 of the Dataplex blog series focussed on Data Catalog. Please refer to Part-1 & Part-2 and Part-3.1 for more details on the Dataplex service.

Tagging:

Tagging involves adding additional pieces of metadata at a Table or a Column level to enrich, contextualize, classify, secure and analyze the underlying data.

Types of metadata:

  • Technical metadata represents the technical aspects of data, including attributes such as source and target systems, database table and field structures, data types, storage, format, location and so on. In our Part-3.1 — we saw how Dataplex automatically collects the Technical Metadata of the registered assets.
  • Business metadata provides a business context and a business name for assets that are created and managed. Business metadata includes terms, information governance rules, labels, ownership etc that helps with the business interpretation of data elements in the data warehouse.
  • Operational metadata captures the events that occurred during the run of the process, including the number of rows written, process run status, rows read, written to, various stages etc

Following is a pictorial view of a table cust_tbl tagged with various metadata tags at Table and Column level, that capture more context about the underlying data.

Types of Metadata Tags

At the Table level, it captures the ETL status of the table updates, its lifecycle details and Quality indicators while at the Column level, it captures if the Column has PII information etc.

Data Catalog’s Tagging features provides multiple ways to create and add such metadata tags that help with the overall Data Governance of the Data platform.

Within Dataplex, Technical Metadata is collected automatically as part of the Discovery process. For capturing Additional Business and Operational metadata, Dataplex provides Tag Templates.

1. Tag Templates:

Dataplex provides ready-to-go Tag templates that are pre-defined for the Business and Operational scenarios and you can also create your own, through an intuitive UI.

Go to Dataplex → Manage Catalog → Tag Templates → Create Tag Template

We will create a new template.

1.1. Create a Tag Template

In the create template screen above, fill in the Template details.

In the Fields section below → click Add fields.

Add the following 3 entries.

Our Template is created! It captures the name of the Steward (Business Metadata) as well as creation_date, record_count (Operational Metadata).

Lets add this to a Table.

1.2. Associate the Tag Template to an Asset:

Go to Dataplex -> Manage ->oda lake →raw zone->crime_staging

We will add the Tag created to the following table. As shown below, the Tags(0) section at the bottom does not have any tags associated with it.

Click the Attach Tags button at the bottom right. Choose the Template we created above.

Add the Business and Operational Metadata to the fields. Click Save.

This will create an instance of the Tag from the Tag Template and add it to the crimes_stage table. The Tag section of the page now shows the attached tag. You can similarly create Column level templates and attach it to the corresponding columns.

Apart from UI, you can also add tags programatically via API or SDK

Using any of the above approaches, you can create and attach tags on the assets and add additional metadata. You can also embed it into the ETL/ELT cycles and update the Tables/Columns metadata of the assets that are managed by this processes.

2. Bulk Tagging:

Some of the most common customer queries around tagging are:

  1. How do I do the tagging of the historical assets that may not be touched by the daily ETL/ELT process?
  2. How do I do large scale Tagging? Writing scripts to do is extremely cumbersome and time-consuming
  3. Are there any Google tools or accelerators to help with it?

Yes there is!

2.1. Tag Engine:

Tag Engine is an open-source extension to Google Cloud’s Data Catalog developed by Shirley Cohen from Google to automate the tagging of BigQuery tables, Views and data lake files.

The tool supports populating tags that have static metadata or dynamic metadata. Tag Engine also supports the autotagging of new assets and the refreshing of existing tags as the underlying data changes.

Following Github links provides complete details of the Tag Engine Architecture, Terraform Deployment scripts and Examples.

https://github.com/GoogleCloudPlatform/datacatalog-tag-engine

Lets use Tag Engine to do bulk tagging dynamically!

2.1.1. Usecase

One of the most common tagging usecase is attaching Business definitions to the Columns. The Column names are often too technical or generic and needs to be tagged to have more Business context about them.

So we will use Tag Engine to solve this.

2.1.1.1 — Create the Template:

2.1.1.2 — Select the Table or Columns to be tagged.

I have a Taxi Dataset (Bigquery public Dataset) and quite a few tables in this dataset have Tables with Tolls_Amount and Tip_Amount columns.

I would like to update the description of these columns across my dataset with a Business definition of the Column.

2.1.1.3 — Set up the Business Metadata Model to be attached to Tables or Columns.

The Business metadata that needs to be applied to the Tags can be externalized as a BigQuery Table.

e.g — I have set up a enterprise_glossary table in Bigquery. It consists of the actual column names in the Column_name field and its Business Definition in the formal_name field that I have gathered from my business team.

Using this information, I need to create a Column Level Tags and attach it to the Columns with its Business Definition automatically.

2.1.1.4— Tag Engine Configuration:

I have installed Tag Engine as an “App Engine” service. Click on the App Engine URL and it will open the home page as shown below.

Enter Tag Template ID, Project Name and region — you can find all the info in the template.

Tag Engine will bring up the Tag Template and provide different options for creating Static and Dynamic Table and Column level tags along with import and restore options.

Select the “Create Dynamic Column Tags” option as we want to generate Column level tags with dynamic values from the database.

Tag Engine provides a query input option against the fields defined in the template to fetch their values from Bigquery.

Enter the query shown below for the column_description placeholder. This query will be executed on the Bigquery glossary table to fetch the formal_name column which has the Business Definition. The $Column value of the Column_name will be dynamically passed by the Tag Engine during execution.

When the tag engine encounters Tolls_Amount or Tip_Amount as Column_name — the query will resolve to their Business Definitions as the output which the Tag engine will use to create a Tag from Tag Template and attach it to the Column.

In the Included Columns queries field, provide the Columns Query as shown above. This query allows you to choose the columns that you want the Tag Engine to tag.

The Included Tables URIs field provides the scope of the Tag Engine Execution. e.g. I have specified it as dataset/taxi_dataset/*. This means Tag engine will search the columns across the entire dataset. For doing it for a specific table, you can replace star with the specific tablename.

Scheduling options: Tag Engine also provides option to execute it as on-demand or set it up as a recurring job.

Additionally, it also provides an option to create history of the tag updates and store it in Bigquery. This is very useful for operational metadata to track the changes that have happened to the Table or Column.

Click Submit.

A job will be created and executed by Tag Engine that will create the tag with their Business Definitions.

Lets verfiy.

Go the Dataplex Search and type one of the business definitions “Govt Taxes on Highways”. Dataplex searches through all the Business Metatags and list all the tables with the above string.

Click on one of the tables and click on the schema.

You will see the Column Tags created for Tips_Amount and Tolls_Amount.

Click on the template and it will show the Business Definition.

In summary, Tag Engine is a great accelerator and an extremely flexible tool to configure and execute static and dynamic Table or Column level tags for Business and Operational metadata at scale.

Conclusion:

Tagging is an integral part of the Data Governance process and Data Catalog provides multiple ways to Tag Tables and Columns through its intuitive UI, SDKs and APIs.

It also provides an open-source tool to automate the large scale dynamic tagging through its Tagging Engine.

For additional details please visit:

https://cloud.google.com/dataplex

--

--

Nishit Kamdar
Google Cloud - Community

Data and Artificial Intelligence specialist at Google. This blog is based on “My experiences from the field”. Views are solely mine.