Business Glossary support in Google Data Catalog
More and more companies migrate their data workloads to the cloud every day. At the same time, new data protection regulation programs become effective around the globe. Such simultaneous events lead companies to enforce their Data Governance standards intending to avoid legal charges due to inappropriate data management.
Data Governance frameworks are hence becoming more common, but providers not necessarily speak the same language — what any of us could certainly expect as each player has their own strategies to tackle problems… On the other hand, customers have distinct requirements that are not fulfilled by a single platform. This brings something new to the market: the need for integrating complementary tools in order to build stronger Data Governance ecosystems.
Metadata Management is a common building block of Data Governance frameworks, and the so-called Business Glossary component helps me to bring up a clear example of the scenario described in the previous paragraph.
A Business Glossary differs from a Data Dictionary in that its focal point, Data Governance, goes beyond a Data Warehouse or database. A Business Glossary is a means of sharing internal vocabulary within an organization. Most Business Glossaries share certain characteristics such as standard Data Definitions and documentation of them; clear definitions with explanation of exceptions, synonyms, or variants .
Among other reasons, companies use it to:
- Enable understanding of the core business concepts and terminology.
- Highlight how vocabulary may differ across business functions.
- Increase trust in a company’s data.
- Reduce the risk that data will be misused due to an inconsistent understanding of the business concepts.
Although a first-class citizen of products such as Alex, Ataccama, IGC, and Informatica, it is not part of Google Cloud’s Data Catalog, which is more focused on simplifying Data Discovery at any scale.
One might wonder whether Data Catalog Tags address the issue of describing and classifying data assets. I agree, they do. So why do we need a Glossary? Because a Glossary is more than this. Let me use two additional features to clarify: (1) its resources are usually organized in a category-based hierarchy for better user experience; and (2) there are particular relationships between Glossary Terms that can result in collaboratively built knowledge, e.g. synonyms lists. Such features are not natively supported by Data Catalog.
Well, what if a company interested in Google Data Catalog has Business Glossary support as a mandatory requirement? Is it a blocking issue? Not at all! Data Catalog’s flexible entity model plus fine-grained IAM Roles are helping hands for those who want to build elementary Business Glossary support on top of Google Data Catalog. This is what I’m going to cover in the next sections.
Disclaimer: all opinions expressed are my own, and represent no one but myself… They come from the experience of being an early adopter of both Google Data Catalog and Egeria.
An Egeria-based custom model
There are multiple Business Glossary providers, each one with their proprietary entity models. Here Egeria comes into the scene: it is an Open Metadata and Governance project which promotes metadata exchange between tools and platforms.
Egeria’s metadata types are open, this is why I’m going to use them as a reference to explain the custom model in Google Data Catalog. But, anyway, the model is expected to be adaptable/extensible to fit real-life requirements even when the glossary metadata come from other sources — hopefully, my reasoning will be good enough to let it clear for you :).
More than providing the Open Metadata Types, Egeria actually works as an enterprise metadata “broker”. Although it is possible to connect Egeria and Data Catalog through the Open Connector Framework (OCF), this is out of this article’s scope. The focus here is on designing a custom model for Business Glossary support in Data Catalog, simply leveraging Egeria types as a reference.
Mapping Entities and Relationships into Entries, Templates, and Tags
Egeria has an extensive set of classes to fully represent a Business Glossary no matter what tool it comes from. For the sake of clarity, I will keep the sample model concise yet practical to describe how introductory Open Metadata Entities and Relationships are mapped into Data Catalog Custom Entries, Templates, and Tags. Mapping entities from one end to another is usually the most difficult part of the job and extending the model is a matter of adding new classes to the below set.
- A Glossary is a collection of related semantic definitions, represented as the
Glossaryentity in the Open Metadata Type System;
- “The vocabulary for the Glossary is documented using Terms. Each Term represents a concept of short phrase in the vocabulary.” A Term, represented as the
GlossaryTermentity, is owned by a Glossary;
- The Semantic Assignment, represented as the
SemanticAssignmentrelationship, is used to assign a Term to a given asset (e.g. BigQuery table or column), which means the Term describes the meaning of that asset.
Please take a look at the below diagrams, they bring a visual presentation of the proposed mappings. Bear in mind the grayed classes’ stereotypes are actual Data Catalog types, while the class names are clues to the Open Metadata Types their instances refer to.
Glossary and Glossary Term
The Glossary itself is mapped as a Custom Entry with
userSpecifiedType = business_glossary. Nothing special here…
Glossary Terms are mapped as Custom Entries with
userSpecifiedType = glossary_term. The standard fields of a Custom Entry are not enough to persist all Glossary Term information and need to be extended at some point. To fulfill this requirement, we can leverage Data Catalog’s flexibility and use Tags created from a particular Tag Template called Glossary Term Specification — their fields increase the Custom Entry metadata storage capabilities. Each Glossary Term Entry should have a Tag created from this Template to store metadata gathered from its underlying Glossary Term in Egeria.
This design also enables us to settle additional features. Linking Terms to their parent Glossary, for instance: the
glossaryName field of a Glossary Term Specification Tag can be used to find all Terms that belong to a given Glossary. Another example: the
guid field, which uniquely identifies a Glossary Term no matter where it is stored, can be used to keep the Terms synchronized between Egeria and Data Catalog.
Other predefined Tag Templates derive from each Glossary Term to enable their Semantic Assignments — 1:1, which means if there are two Glossary Terms, Customer name and Street name, there will be two Glossary Term Semantic Assignment Tag Templates identified by something like
semantic_street_name. They are used to create Tags that represent the Semantic Assignments, linking Glossary Terms to their related assets.
Please notice the Glossary Term Semantic Assignment Tags, created from those particular Templates, are used to describe the meaning of ordinary Data Catalog Entries or Schema Columns.
The rationale behind this design is that at the time this document has been written, November 2020, this is a feasible way to set more than one Semantic Assignment to an asset. Data Catalog currently supports attaching only a single Tag per Template on a given asset.
Supporting more features
Now you’ve got the foundation to increment the model by adding more features to the Data Catalog Glossary according to your needs. Would you like to try bringing the Glossary Category entity and its related Term Categorization relationship to Data Catalog as an exercise? Or maybe something related to Synonyms…
Is automation required?
Yes, it is! Because there’s no way to manually create Data Catalog Custom Entries through the UI. The good news is that simple Python scripts or
curl requests are enough to get started.
The automation level to pursue depends on business requirements and technical concerns, especially in terms of API availability and event notification mechanisms. I have seen at least three levels:
- Level 1, on-demand ingestion: fits occasional metadata ingestion and can be used to deliver one-way sync, i.e. the Business Glossary is read from an external source and copied into Data Catalog from time to time.
- Level 2, scheduled or real-time ingestion: a second layer is required if the external source is an information system used to keep the corporate Business Glossary alive and kicking. Scheduled jobs or a real-time event bus that gets metadata from that system and keeps Data Catalog copies synchronized are usually the way to fulfill such a requirement.
- Level 3, two-way sync: more sophisticated automation is needed to deliver two-way synchronization, which means the Business Glossary can be modified both in the source system and Data Catalog, and they must be in sync. More than having scheduled jobs or real-time event notifications, it requires appropriate access control on both sides.
Setting up a read-only Glossary
Depending on how the “Data Catalog Glossary” integrates with an externally-managed Glossary, users might want to have it totally or partially read-only in GCP. This requirement can be fulfilled with appropriate Projects and IAM setup, as follows.
The Main Project hosts all Glossary Entries, Specification and Relationship Templates, and Specification Tags, which are ideally managed by automated processes through a Service Account, with no human interaction. The Service Account is expected to have elevated privileges:
DataCatalog entryGroup Owner,
DataCatalog entry Owner,
Data Catalog TagTemplate Owner, and
Data Catalog Tag Editor.
Glossary users coming from Dependent Projects, such as Data Engineers or Analysts, should not have more than
Data Catalog Viewer and
Data Catalog TagTemplate User IAM Roles in the Main Project. This means they can view the Glossary Entries, Specification and Relationship Templates, and Specification Tags, but not edit them.
People might be able to use the managed Templates to create Data Catalog Tags, though. Tags that represent Semantic Assignments are a good example. Such Tags are attached to Entries that belong to the Dependent Projects, where users must have the
Data Catalog Viewer and
Data Catalog Tag Editor IAM Roles to get the job done.
A proof of concept
I’ve created a GitHub repository to host a Python package that anyone can use to see the proposed model in action: github.com/ricardolsmendes/datacatalog-custom-model-manager.
By the way, that piece of code kind of addresses the level-1 automation strategy explained before and can be used to quickly validate hypotheses when adding new features to custom models.
There are sample input files in the
sample-input/egeria-business-glossary folder. I’ve set up a project in GCP and ran the code, getting the results presented below:
The blue box shows metadata from a Custom Entry mapped from an Egeria Glossary Term. The yellow box shows the specification Tag used to enrich its metadata, adding fields that are not available in the Entry.
The above popup and the yellow arrow show a Semantic Assignment. It allows users to know the column
ctm_nm, from a BigQuery Table, stores names and it was validated by someone. The same feature allows users to know the
ctm_es column stores educational stage information (please notice the
Semantic — Education Tag).
That’s pretty much it to prove the proposed model works in Data Catalog :).
Although helpful, working with such a custom model has its caveats…
- A considerable amount of API calls is required to perform the initial load depending on the number of Glossary Terms and Semantic Assignments to be ingested into Data Catalog: 1 Create Entry + 2 Create Tag Template + 1 Create Tag requests per Glossary Term; 1 Create Tag request per Semantic Assignment.
- The Data Catalog UI might leave something to desire when it comes to native Business Glossary capabilities that are not supported by the product yet, such as parent-child Entry browsing.
Business Glossary has no native support in Google Data Catalog, but it doesn’t mean a blocking issue for companies that are adopting Google Cloud and need a Glossary to fulfill their Data Governance requirements. Data Catalog is very flexible, thus users can build custom metadata models on top of it and cut down the impact of missing features.
I hope the ideas I brought up in the present article help your company to deploy a Business Glossary in Data Catalog and leverage it to strengthen their Data Governance practices.
- What is a Business Glossary?: dataversity.net/what-is-a-business-glossary
- Business Glossary Basics: dataversity.net/business-glossary-basics
- Egeria — Open Metadata and Governance: egeria.odpi.org
- Egeria — Open Metadata Type System: egeria.odpi.org/open-metadata-publication/website/open-metadata-types
- Egeria — Open Metadata Models for Glossary and Semantics: egeria.odpi.org/open-metadata-publication/website/open-metadata-types/Area-3-models.html
- Creating custom Data Catalog entries: https://cloud.google.com/data-catalog/docs/how-to/custom-entries