Metamodels — Foundation of Data Governance platforms — Part 1

Anand Govindarajan
8 min readSep 18, 2023

--

Capabilities needed to manage metamodels and their Design considerations

Metamodels or the metadata models are one of the foundational components of Data Governance (DG) platforms. Like how data models are foundational for applications to persist and query the data, metamodel supports metadata storage/querying by the DG platforms.

Metamodel design is driven by the DG use cases and the metadata they operate on like how data models are based on the application use cases. Metadata is typically categorized as Business metadata (Business Term, Business Process, Metrics/KPI etc.), Data assets-related metadata (Data Entity, Data Attribute, Table, Column, Reference data like Code set, code values), Governance metadata (Policy, Business Rule, Data Quality rule & metric), Technical/Technology metadata (Server, Database, Application, System).

Two key expectations from the metamodel design are its (a) Ability to support metadata from variety of sources — be it data and technology metadata from various platforms or Business metadata from various processes (b) Ability to provide a user-friendly nomenclature to these metadata objects that help users easily search and understand the metadata relating them to the platform it was sourced from For e.g. when sourcing metadata from Tableau, ability to name these as ‘Tableau Workbooks’, ‘Tableau Worksheets’ etc. make them meaningful and relatable than naming them as just Reports or Dashboards.

In Part 1 of this article, we will look at the various capabilities that a DG platform should provide to manage these metamodels. In addition, I will also review some of the key considerations when designing the metamodel based on experiences in large DG implementations.

Apart from the core metamodel design capability, some of the features such as managing the organization of the metadata within the DG platform and managing the life cycle of the metadata through stewardship workflows mapped with relevant roles & permissions, cannot be left out. Hence including them as well.

Here goes the list:

Richness of the built-in Metadata types, attributes and relations — The richness of the metadata types provided built-in by the DG platforms supporting the Business, Data, Governance and Technology metadata types acts as a very good starting point for metamodel design. Like an industry data model that accelerates data model design, the built-in metamodel guides setting up the initial metamodel. For example, when we look to define a Glossary, the built-in metadata types such as Business Term or Acronym guides the DG team in defining what attributes a Business Term needs to carry. Definition for a Term is key and maybe the relation to a synonym Business Term. Similarly in case of an Acronym, the expanded for of the acronym captured as a definition and relation to a Business Term that provides a Business definition is key.

This is a good start for a glossary to be stood up. As the users experience the glossary, additional attributes and relations can evolve and be included in the metamodel.

Considerations: It is a good practice to leverage as much built-in metadata as possible given that they can evolve well with the product/platform enhancing them integrated well with rest of the product features. However, if the richness available is limited then restricting ourselves to built-in is not a good idea.

Ability to define custom Metadata types — This is a key capability to make your DG platform extensible and meeting the two expectations from metamodel design that I had mentioned above. The nature of metadata sources could vary and force fitting them into some standard built-in asset types is not a good idea. I see this as one of the key differentiator for DG platforms as force fitting different metadata into the same type not only impacts the user experience but also becomes a challenge in varying its attributes and relations based on what it represents. For example, Alation as platform provides a generic Article metadata /object type. In case you want to define a ‘Metric’ or even a ‘Application’ metadata type, the Article object has to be used and customized to handle both Metric and an Application. When viewing a Metric, we need to hide the attributes and relations of an Application and vice versa.

Considerations: Challenges come when custom metadata types defined by us is also released as part of the product capability down the line. We need to manage this ensuring that there is clear distinction between the custom and the built-in type so that the end users are not confused or impacted.

Ability to define custom attributes and relations — As an organization matures in its DG journey, the richness of the metadata that is captured also goes up. More the attributes and relations captured in the metadata better is its ability to be searched and its ability to provide better context of the asset it describes, which is the whole point about metadata management. Let us say you start with a metadata type called Data Product. The purpose of this is to describes the actual Data Products in the organization. As a concept, this is still evolving and people are figuring out the best way to build these products. From a governance stand point, we want to capture its metadata. So we start with the an initial list of attributes and relations that we see commonly across these data products. As these object mature, we see that there is a need to capture more metadata characteristics around them. This is when the capability of defining custom attributes and relations becomes key. The reason I kept this separate from the previous item of custom metadata type is that some of the platforms allow defining custom metadata type reusing the built-in attributes & relations and limited custom attributes/relations. Flexibility to add custom attributes/relations to built-in metadata types as well as custom metadata types is important.

Considerations: Here again leveraging built-in is a good practice unless it is limiting. One aspect when defining custom attributes/relations would be to see how they can be defined generic enough to be reused thus avoiding proliferation of such characteristics.

Support for simple and complex relations — This is more a terminology used by the Collibra platform but it is one of the powerful features that exist in the metamodel design. Simple relation relates two metadata objects whereas Complex relation relates more than two metadata objects. Linking a Business Term to a Column to describe the column is a simple relation whereas linking the sources and targets of a data pipeline along with its field level mapping is a complex relation. Complex relation can sometimes be looked at as a combination of simple relation but most of the times, it is the relation between the two objects in the context of the third (for example, a SQL query used for a Metric cannot be related in isolation without the context of the System it is executed on)

Considerations: Most of the scenarios warrant only a simple relation and is a suggested practice even by the products as it is easier to load content as well as easier to understand for the users. However, complex relation scenarios cannot be replaced by simple relations.

Support for defining attributes on relations — As an extension on the above complex relation capability, it would be good to define attributes that describe the relations. In the data pipeline example above, it helps to capture the actual transformation logic as an attribute in the field map between source and target systems. These attributes cannot be captured as part of neither the source system nor the target. It is in the context of the source and target field mapping. For example, let us say a Transaction table in POS system maps to the Sales Roll up table in a Retail Data Mart with field mapping between the two. For each of the columns mapped, we can provide the transformation logic. For transaction_amt to Sales column, we can capture ‘ select sum(e.transaction_amt) from ePOS.Transaction e…’

Ability to organize metadata in well-defined structure — Now we look at capabilities that support the metadata design to make the foundation robust to support any of the data governance use cases. One such is the Taxonomy supporting how the different metadata objects are organized within the platform for user consumption. There are two levels of organization (a) organize the metadata objects based on access & ownership — Metrics belonging to a Finance function should be accessible only by that team where Enterprise-level Metrics are accessible by all the users (b) organize by metadata types — grouping business terms as glossary, data models as a logical data dictionary and schema/tables/columns under a physical data dictionary. Both are key to drive user experience within the data governance platform

Considerations: Defining the taxonomy is as important as the metamodel design and has to be done in parallel/in sync with the metamodel. When a new metadata type is defined, there should be clarity where these types would sit and what roles would manage these etc.

Richness of the built-in roles and permissions — This is another key capability that manages how the metadata objects are allowed to be created/read/updated/deleted from the platform. Roles again exists at two-levels in the platform (a) Product module/capability-level — to manage that module specific capability e.g. setting up a metadata connector will need relevant roles to setup and manage connectors (b) Resource level such as at the metadata object level or the metadata taxonomy folder/structure level that manages all metadata within them. This provides granular permissions for role to perform CRUD operations on the objects as well as to participate in other stewardship activities.

Ability to define custom roles and permissions — In case of large organizations, the number and variations of roles will warrant an ability to extend built-in roles and their corresponding permissions.

Considerations: Typically, as a best practice, it is better to create custom roles even if one single permission of the role is different from that of the built-in role. That helps easier management.

Richness of built-in Stewardship workflows — Stewardship workflows are key for enforcing the data governance policies and standards as well as managing the life cycle of metadata objects within the DG platform. Some of the platforms provide vary basic capability such as just approving create/update of the objects whereas some of the platforms go beyond to support multi-step approvals using voting mechanism, assigning ownership/roles to metadata objects, trigger metadata ingestion connectors and even interface with external systems through APIs.

Ability to build custom stewardship workflows — As we have seen above, what is offered built-in can jump start the DG implementation whereas as we mature the DG use cases, ability to extend the built-in capability becomes key. Custom workflows can help us imagine scenarios where metadata objects’ life cycle is just not restricted within the platform but extend to drive actions in external platforms. For example, a workflow managing access to say a table or a data set is available built-in where it assigns the access to specific user groups after validating relevant criteria. Then the rest of the process is assumed to be manual and outside the DG platform. Now the custom workflow can go a step further to initiate the actual access provisioning in the target platform say Snowflake or place a request to the relevant IT team in ServiceNow. Then the status of this provisioning can then be reflected back in the DG platform. That is the power custom workflows.

Considerations: Custom workflows should be built keeping in mind the boundaries imposed by the DG product/platform and not act as a loop hole to break these boundaries. For example, the DG product might be a cloud-based platform and commits to not accessing an organization’s on-prem systems directly. The workflows, due to their ability to make external API calls, cannot violate this boundary.

In the next part (Part 2) of this article, we will look at how the three leading DG platforms — Collibra, Alation and MS Purview support the above capabilities. Till then eager to hear your thoughts and experiences on the above topic!

--

--

Anand Govindarajan

Anand is a CDMP, CBIP, TOGAF 9 professional, with more than 28 years solving Data Management and Governance challenges for several Fortune 100 customers