Business Glossary support in Google Data Catalog

A custom model

Ricardo Mendes
Google Cloud - Community
10 min readNov 11, 2020

--

06/30/2023 note: Business Glossaries are now natively supported by Dataplex, which makes the approach proposed in the present article obsolete.

More and more companies migrate their data workloads to the cloud every day. At the same time, new data protection regulation programs become effective around the globe. Such simultaneous events lead companies to enforce their Data Governance standards intending to avoid legal charges due to inappropriate data management.

Data Governance frameworks are hence becoming more common, but providers do not necessarily speak the same language — what any of us could undoubtedly expect as each player has their own strategies to tackle problems… On the other hand, customers have distinct requirements that are not fulfilled by a single platform. This brings something new to the market: the need for integrating complementary tools in order to build stronger Data Governance ecosystems.

Metadata Management is a standard building block of Data Governance frameworks, and the so-called Business Glossary component helps me to bring up a clear example of the scenario described in the previous paragraph.

A Business Glossary differs from a Data Dictionary in that its focal point, Data Governance, goes beyond a Data Warehouse or database. A Business Glossary is a means of sharing internal vocabulary within an organization. Most Business Glossaries share certain characteristics such as standard Data Definitions and documentation of them; clear definitions with explanation of exceptions, synonyms, or variants .

— datadiversity

Among other reasons, companies use it to:

  • Enable understanding of the core business concepts and terminology.
  • Highlight how vocabulary may differ across business functions.
  • Increase trust in a company’s data.
  • Reduce the risk that data will be misused due to an inconsistent understanding of the business concepts.

Although a first-class citizen of products such as Alex, Ataccama, IGC, and Informatica, it is not part of Google Cloud’s Data Catalog, which is more focused on simplifying Data Discovery at any scale.

One might wonder whether Data Catalog Tags address the issue of describing and classifying data assets. I agree; they do. So why do we need a Glossary? Because a Glossary is more than this. Let me use two additional features to clarify: (1) its resources are usually organized in a category-based hierarchy for better user experience; and (2) there are particular relationships between Glossary Terms that can result in collaboratively built knowledge, e.g. synonyms lists. Data Catalog does not natively support such features.

Well, what if a company interested in Google Data Catalog has Business Glossary support as a mandatory requirement? Is it a blocking issue? Not at all! Data Catalog’s flexible entity model plus fine-grained IAM Roles are helping hands for those who want to build elementary Business Glossary support on top of Google Data Catalog. This is what I’m going to cover in the following sections.

Disclaimer: all opinions expressed are my own, representing no one but myself… They come from the experience of being an early adopter of both Google Data Catalog and Egeria.

An Egeria-based custom model

There are multiple Business Glossary providers, each one with its proprietary entity models. Here Egeria comes into the scene: it is an Open Metadata and Governance project which promotes metadata exchange between tools and platforms.

Egeria’s metadata types are open, this is why I’m going to use them as a reference to explain the custom model in Google Data Catalog. But, anyway, the model is expected to be adaptable/extensible to fit real-life requirements even when the glossary metadata come from other sources — hopefully, my reasoning will be good enough to make it clear for you :).

The Open Metadata Types are organized in 7 areas, with Glossary and Semantics covered in Area 3.

More than providing the Open Metadata Types, Egeria actually works as an enterprise metadata “broker”. Although it is possible to connect Egeria and Data Catalog through the Open Connector Framework (OCF), this is out of this article’s scope. The focus here is on designing a custom model for Business Glossary support in Data Catalog, simply leveraging Egeria types as a reference.

Mapping Entities and Relationships into Entries, Templates, and Tags

Egeria has an extensive set of classes to fully represent a Business Glossary, no matter what tool it comes from. For the sake of clarity, I will keep the sample model concise yet practical to describe how introductory Open Metadata Entities and Relationships are mapped into Data Catalog Custom Entries, Templates, and Tags. Mapping entities from one end to another is usually the most challenging part of the job, and extending the model is a matter of adding new classes to the below set.

Let me map the Glossary, Glossary Term, and Semantic Assignment from Egeria to Data Catalog, starting by briefly explaining them to readers who are new to Egeria:

  • A Glossary is a collection of related semantic definitions, represented as the Glossary entity in the Open Metadata Type System;
  • “The vocabulary for the Glossary is documented using Terms. Each Term represents a concept of a short phrase in the vocabulary.” A Term, represented as the GlossaryTerm entity, is owned by a Glossary;
  • The Semantic Assignment, represented as the SemanticAssignment relationship, is used to assign a Term to a given asset (e.g. BigQuery table or column), which means the Term describes the meaning of that asset.

Please take a look at the below diagrams, they bring a visual presentation of the proposed mappings. Bear in mind the grayed classes’ stereotypes are actual Data Catalog types, while the class names are clues to the Open Metadata Types their instances refer to.

Glossary and Glossary Term

The Glossary itself is mapped as a Custom Entry with userSpecifiedType = business_glossary. Nothing special here…

Class diagram: Egeria Glossary and GlossaryTem mapped as Google Data Catalog entities
Class diagram 1: Egeria Glossary and GlossaryTem mapped as Google Data Catalog entities

Glossary Terms are mapped as Custom Entries with userSpecifiedType = glossary_term. The standard fields of a Custom Entry are insufficient to persist all Glossary Term information and need to be extended at some point. To fulfill this requirement, we can leverage Data Catalog’s flexibility and use Tags created from a particular Tag Template called Glossary Term Specification — their fields increase the Custom Entry metadata storage capabilities. Each Glossary Term Entry should have a Tag created from this Template to store metadata gathered from its underlying Glossary Term in Egeria.

This design also enables us to settle additional features. Linking Terms to their parent Glossary, for instance: the glossaryName field of a Glossary Term Specification Tag can be used to find all Terms that belong to a given Glossary. Another example: the guid field, which uniquely identifies a Glossary Term no matter where it is stored, can be used to keep the Terms synchronized between Egeria and Data Catalog.

Semantic Assignment

Other predefined Tag Templates derive from each Glossary Term to enable their Semantic Assignments — 1:1, which means if there are two Glossary Terms, Customer name and Street name, there will be two Glossary Term Semantic Assignment Tag Templates identified by something like semantic_customer_name and semantic_street_name. They are used to create Tags representing the Semantic Assignments, linking Glossary Terms to their related assets.

Class diagram: Egeria SemanticAssignment mapped as Google Data Catalog entities
Class diagram 2: Egeria SemanticAssignment mapped as Google Data Catalog entities

Please notice the Glossary Term Semantic Assignment Tags, created from those particular Templates, are used to describe the meaning of ordinary Data Catalog Entries or Schema Columns.

The rationale behind this design is that at the time this document was written, November 2020, this is a feasible way to set more than one Semantic Assignment to an asset. Data Catalog currently supports attaching only a single Tag per Template on a given asset.

Supporting more features

Now you’ve got the foundation to increment the model by adding more features to the Data Catalog Glossary according to your needs. Would you like to try bringing the Glossary Category entity and its related Term Categorization relationship to Data Catalog as an exercise? Or maybe something related to Synonyms

Is automation required?

Yes, it is! Because there’s no way to create Data Catalog Custom Entries through the UI manually. The good news is that simple Python scripts or curl requests are enough to get started.

The automation level to pursue depends on business requirements and technical concerns, especially regarding API availability and event notification mechanisms. I have seen at least three levels:

  • Level 1, on-demand ingestion: fits occasional metadata ingestion and can be used to deliver one-way sync, i.e. the Business Glossary is read from an external source and copied into Data Catalog from time to time.
  • Level 2, scheduled or real-time ingestion: a second layer is required if the external source is an information system that keeps the corporate Business Glossary alive and kicking. Scheduled jobs or a real-time event bus that gets metadata from that system and keeps Data Catalog copies synchronized are usually the way to fulfill such a requirement.
  • Level 3, two-way sync: more sophisticated automation is needed to deliver two-way synchronization, which means the Business Glossary can be modified both in the source system and Data Catalog, and they must be in sync. More than having scheduled jobs or real-time event notifications, it requires appropriate access control on both sides.

Setting up a read-only Glossary

Depending on how the “Data Catalog Glossary” integrates with an externally-managed Glossary, users might want to have it totally or partially read-only in GCP. This requirement can be fulfilled with appropriate Projects and IAM setup, as follows.

GCP Architecture: suggested projects structure for read-only Business Glossary support in Google Data Catalog
GCP Architecture: suggested projects structure for read-only Business Glossary support in Google Data Catalog

The Main Project hosts all Glossary Entries, Specification and Relationship Templates, and Specification Tags, which are ideally managed by automated processes through a Service Account, with no human interaction. The Service Account is expected to have elevated privileges: DataCatalog entryGroup Owner, DataCatalog entry Owner, Data Catalog TagTemplate Owner, and Data Catalog Tag Editor.

Glossary users coming from Dependent Projects, such as Data Engineers or Analysts, should not have more than Data Catalog Viewer and Data Catalog TagTemplate User IAM Roles in the Main Project. This means they can view the Glossary Entries, Specification and Relationship Templates, and Specification Tags, but not edit them.

People might be able to use the managed Templates to create Data Catalog Tags, though. Tags that represent Semantic Assignments are a good example. Such Tags are attached to Entries that belong to the Dependent Projects, where users must have the Data Catalog Viewer and Data Catalog Tag Editor IAM Roles to get the job done.

A proof of concept

I’ve created a GitHub repository to host a Python package that anyone can use to see the proposed model in action: github.com/ricardolsmendes/datacatalog-custom-model-manager.

By the way, that piece of code kind of addresses the level-1 automation strategy explained before and can be used to validate hypotheses when adding new features to custom models quickly.

There are sample input files in the sample-input/egeria-business-glossary folder. I’ve set up a project in GCP and ran the code, getting the results presented below:

Google Data Catalog entities: Custom Entry and Tag mapping an Egeria Glossary Term
Google Data Catalog screenshot 1: Custom Entry and Tag mapping an Egeria Glossary Term

The blue box shows metadata from a Custom Entry mapped from an Egeria Glossary Term. The yellow box shows the specification Tag used to enrich its metadata, adding fields that are not available in the Entry.

Google Data Catalog screenshot: Tag mapping an Egeria Semantic Assignment
Google Data Catalog screenshot 2: Tag mapping an Egeria Semantic Assignment

The above popup and the yellow arrow show a Semantic Assignment. It allows users to know the column ctm_nm, from a BigQuery Table, stores names, and was validated by someone. The same feature allows users to know the ctm_es column stores educational stage information (please notice the Semantic — Education Tag).

That’s pretty much proof that the proposed model works in Data Catalog :).

Remarks

Although helpful, working with such a custom model has its caveats…

  • A considerable amount of API calls is required to perform the initial load depending on the number of Glossary Terms and Semantic Assignments to be ingested into Data Catalog: 1 Create Entry + 2 Create Tag Template + 1 Create Tag requests per Glossary Term; 1 Create Tag request per Semantic Assignment.
  • The Data Catalog UI might leave something to desire when it comes to native Business Glossary capabilities that are not supported by the product yet, such as parent-child Entry browsing.

Wrapping up

Business Glossary has no native support in Google Data Catalog, but it doesn’t mean a blocking issue for companies that are adopting Google Cloud and need a Glossary to fulfill their Data Governance requirements. Data Catalog is very flexible; thus, users can build custom metadata models on top of it and cut down the impact of missing features.

I hope the ideas I brought up in the present article help your company to deploy a Business Glossary in Data Catalog and leverage it to strengthen its Data Governance practices.

Cheers,

References

--

--