Business Glossary support in Google Data Catalog

A custom model

Ricardo Mendes
Nov 11, 2020 · 10 min read

More and more companies migrate their data workloads to the cloud every day. At the same time, new data protection regulation programs become effective around the globe. Such simultaneous events lead companies to enforce their Data Governance standards intending to avoid legal charges due to inappropriate data management.

Data Governance frameworks are hence becoming more common, but providers not necessarily speak the same language — what any of us could certainly expect as each player has their own strategies to tackle problems… On the other hand, customers have distinct requirements that are not fulfilled by a single platform. This brings something new to the market: the need for integrating complementary tools in order to build stronger Data Governance ecosystems.

Metadata Management is a common building block of Data Governance frameworks, and the so-called Business Glossary component helps me to bring up a clear example of the scenario described in the previous paragraph.

A Business Glossary differs from a Data Dictionary in that its focal point, Data Governance, goes beyond a Data Warehouse or database. A Business Glossary is a means of sharing internal vocabulary within an organization. Most Business Glossaries share certain characteristics such as standard Data Definitions and documentation of them; clear definitions with explanation of exceptions, synonyms, or variants .

— datadiversity

Among other reasons, companies use it to:

Although a first-class citizen of products such as Alex, Ataccama, IGC, and Informatica, it is not part of Google Cloud’s Data Catalog, which is more focused on simplifying Data Discovery at any scale.

One might wonder whether Data Catalog Tags address the issue of describing and classifying data assets. I agree, they do. So why do we need a Glossary? Because a Glossary is more than this. Let me use two additional features to clarify: (1) its resources are usually organized in a category-based hierarchy for better user experience; and (2) there are particular relationships between Glossary Terms that can result in collaboratively built knowledge, e.g. synonyms lists. Such features are not natively supported by Data Catalog.

Well, what if a company interested in Google Data Catalog has Business Glossary support as a mandatory requirement? Is it a blocking issue? Not at all! Data Catalog’s flexible entity model plus fine-grained IAM Roles are helping hands for those who want to build elementary Business Glossary support on top of Google Data Catalog. This is what I’m going to cover in the next sections.

Disclaimer: all opinions expressed are my own, and represent no one but myself… They come from the experience of being an early adopter of both Google Data Catalog and Egeria.

An Egeria-based custom model

There are multiple Business Glossary providers, each one with their proprietary entity models. Here Egeria comes into the scene: it is an Open Metadata and Governance project which promotes metadata exchange between tools and platforms.

Egeria’s metadata types are open, this is why I’m going to use them as a reference to explain the custom model in Google Data Catalog. But, anyway, the model is expected to be adaptable/extensible to fit real-life requirements even when the glossary metadata come from other sources — hopefully, my reasoning will be good enough to let it clear for you :).

The Open Metadata Types are organized in 7 areas, being Glossary and Semantics covered in Area 3.

More than providing the Open Metadata Types, Egeria actually works as an enterprise metadata “broker”. Although it is possible to connect Egeria and Data Catalog through the Open Connector Framework (OCF), this is out of this article’s scope. The focus here is on designing a custom model for Business Glossary support in Data Catalog, simply leveraging Egeria types as a reference.

Mapping Entities and Relationships into Entries, Templates, and Tags

Egeria has an extensive set of classes to fully represent a Business Glossary no matter what tool it comes from. For the sake of clarity, I will keep the sample model concise yet practical to describe how introductory Open Metadata Entities and Relationships are mapped into Data Catalog Custom Entries, Templates, and Tags. Mapping entities from one end to another is usually the most difficult part of the job and extending the model is a matter of adding new classes to the below set.

Let me map the Glossary, Glossary Term, and Semantic Assignment from Egeria to Data Catalog, starting from briefly explaining them for readers who are new to Egeria:

Please take a look at the below diagrams, they bring a visual presentation of the proposed mappings. Bear in mind the grayed classes’ stereotypes are actual Data Catalog types, while the class names are clues to the Open Metadata Types their instances refer to.

Glossary and Glossary Term

The Glossary itself is mapped as a Custom Entry with userSpecifiedType = business_glossary. Nothing special here…

Class diagram: Egeria Glossary and GlossaryTem mapped as Google Data Catalog entities
Class diagram: Egeria Glossary and GlossaryTem mapped as Google Data Catalog entities
Class diagram 1: Egeria Glossary and GlossaryTem mapped as Google Data Catalog entities

Glossary Terms are mapped as Custom Entries with userSpecifiedType = glossary_term. The standard fields of a Custom Entry are not enough to persist all Glossary Term information and need to be extended at some point. To fulfill this requirement, we can leverage Data Catalog’s flexibility and use Tags created from a particular Tag Template called Glossary Term Specification — their fields increase the Custom Entry metadata storage capabilities. Each Glossary Term Entry should have a Tag created from this Template to store metadata gathered from its underlying Glossary Term in Egeria.

This design also enables us to settle additional features. Linking Terms to their parent Glossary, for instance: the glossaryName field of a Glossary Term Specification Tag can be used to find all Terms that belong to a given Glossary. Another example: the guid field, which uniquely identifies a Glossary Term no matter where it is stored, can be used to keep the Terms synchronized between Egeria and Data Catalog.

Semantic Assignment

Other predefined Tag Templates derive from each Glossary Term to enable their Semantic Assignments — 1:1, which means if there are two Glossary Terms, Customer name and Street name, there will be two Glossary Term Semantic Assignment Tag Templates identified by something like semantic_customer_name and semantic_street_name. They are used to create Tags that represent the Semantic Assignments, linking Glossary Terms to their related assets.

Class diagram: Egeria SemanticAssignment mapped as Google Data Catalog entities
Class diagram: Egeria SemanticAssignment mapped as Google Data Catalog entities
Class diagram 2: Egeria SemanticAssignment mapped as Google Data Catalog entities

Please notice the Glossary Term Semantic Assignment Tags, created from those particular Templates, are used to describe the meaning of ordinary Data Catalog Entries or Schema Columns.

The rationale behind this design is that at the time this document has been written, November 2020, this is a feasible way to set more than one Semantic Assignment to an asset. Data Catalog currently supports attaching only a single Tag per Template on a given asset.

Supporting more features

Now you’ve got the foundation to increment the model by adding more features to the Data Catalog Glossary according to your needs. Would you like to try bringing the Glossary Category entity and its related Term Categorization relationship to Data Catalog as an exercise? Or maybe something related to Synonyms

Is automation required?

Yes, it is! Because there’s no way to manually create Data Catalog Custom Entries through the UI. The good news is that simple Python scripts or curl requests are enough to get started.

The automation level to pursue depends on business requirements and technical concerns, especially in terms of API availability and event notification mechanisms. I have seen at least three levels:

Setting up a read-only Glossary

Depending on how the “Data Catalog Glossary” integrates with an externally-managed Glossary, users might want to have it totally or partially read-only in GCP. This requirement can be fulfilled with appropriate Projects and IAM setup, as follows.

GCP Architecture: suggested projects structure for read-only Business Glossary support in Google Data Catalog
GCP Architecture: suggested projects structure for read-only Business Glossary support in Google Data Catalog

The Main Project hosts all Glossary Entries, Specification and Relationship Templates, and Specification Tags, which are ideally managed by automated processes through a Service Account, with no human interaction. The Service Account is expected to have elevated privileges: DataCatalog entryGroup Owner, DataCatalog entry Owner, Data Catalog TagTemplate Owner, and Data Catalog Tag Editor.

Glossary users coming from Dependent Projects, such as Data Engineers or Analysts, should not have more than Data Catalog Viewer and Data Catalog TagTemplate User IAM Roles in the Main Project. This means they can view the Glossary Entries, Specification and Relationship Templates, and Specification Tags, but not edit them.

People might be able to use the managed Templates to create Data Catalog Tags, though. Tags that represent Semantic Assignments are a good example. Such Tags are attached to Entries that belong to the Dependent Projects, where users must have the Data Catalog Viewer and Data Catalog Tag Editor IAM Roles to get the job done.

A proof of concept

I’ve created a GitHub repository to host a Python package that anyone can use to see the proposed model in action: github.com/ricardolsmendes/datacatalog-custom-model-manager.

By the way, that piece of code kind of addresses the level-1 automation strategy explained before and can be used to quickly validate hypotheses when adding new features to custom models.

There are sample input files in the sample-input/egeria-business-glossary folder. I’ve set up a project in GCP and ran the code, getting the results presented below:

Google Data Catalog entities: Custom Entry and Tag mapping an Egeria Glossary Term
Google Data Catalog screenshot 1: Custom Entry and Tag mapping an Egeria Glossary Term

The blue box shows metadata from a Custom Entry mapped from an Egeria Glossary Term. The yellow box shows the specification Tag used to enrich its metadata, adding fields that are not available in the Entry.

Google Data Catalog screenshot: Tag mapping an Egeria Semantic Assignment
Google Data Catalog screenshot 2: Tag mapping an Egeria Semantic Assignment

The above popup and the yellow arrow show a Semantic Assignment. It allows users to know the column ctm_nm, from a BigQuery Table, stores names and it was validated by someone. The same feature allows users to know the ctm_es column stores educational stage information (please notice the Semantic — Education Tag).

That’s pretty much it to prove the proposed model works in Data Catalog :).

Remarks

Although helpful, working with such a custom model has its caveats…

Wrapping up

Business Glossary has no native support in Google Data Catalog, but it doesn’t mean a blocking issue for companies that are adopting Google Cloud and need a Glossary to fulfill their Data Governance requirements. Data Catalog is very flexible, thus users can build custom metadata models on top of it and cut down the impact of missing features.

I hope the ideas I brought up in the present article help your company to deploy a Business Glossary in Data Catalog and leverage it to strengthen their Data Governance practices.

Cheers,

References

Google Cloud - Community

Google Cloud community articles and blogs

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Ricardo Mendes

Written by

head of data @ ciandt.com • hobbyist tech writer • dad • birder

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.