Implementing the Metadata Lake…

4 min readFeb 10, 2023

In my previous blog on ‘Building a case for a Metadata Lake’, I had written about how integrating metadata from several platforms generates value like the way a data integration store like a Data lake or a Lakehouse does.

Before we decide on the options, let us list down the key characteristics expected from this metadata store:

Flexibility to store different metadata object types — Metamodel should be customizable and extensible to support variety of metadata object types. We are looking at variety of Business, Technical and Operational/Security metadata to be persisted here. Managing metadata types or the metamodel per se is as challenging as that of an application data model or that of a data warehouse. There needs to be a balance between what metadata types to be standardized and what can be custom to its source platform. For example, should we name the report metadata from platforms such as Power BI, Tableau, Qlik as a ‘Report’ type to standardize or do we call them a ‘Report page’, ‘Worksheet’, or ‘Sheet’ respectively to make them relevant for end users based on the source platform.
Ability to extend the metadata object types — This is key as organizations onboard platforms incrementally into the lake and we cannot foresee all metadata types upfront. It could be a ‘Story’ metadata coming in from a dashboarding platform, a ‘Sensitivity label’ coming from a Data protection platform or an ‘Access policy’ from a Data security platform.
Ability to link or relate metadata object types — This is a key characteristic that will drive the ‘metadata stitching’, the foundation for data lineage and related analysis. Be it a link between metadata of a ETL job to that of its source/target table structures or a link between a metric in a report to the underlying data sets, they help drive root-cause analysis for data quality issues tracing the lineage/origin of data. Impact/Dependency analysis i.e. knowing what will break if the developer changes a table structure is also driven by tracing these links. Another useful feature is the aspect of propagating metadata through these links. If I know that one of the source column holds PII data, the same info can be propagated to the Data pipelines and finally to the report attribute that uses it, enabling the attribute to be masked/obfuscated depending on the user role.
Ability to store rich and extensible properties/attributes both for the objects and their relations — this would drive powerful searches and insights. Users can search on object properties or that of the relations. For example, you can query all reports assigned to a specific owner captured as its property or all reports that are linked to a business unit where business unit, represented as another metadata object, is filtered for the required value.

Let us look at three common options that we have to address the above needs — a Relational database (eg: SQL server), NoSQL Document database (eg: MongoDB), Graph database (eg: Neo4j).

Definitely the ‘Graph Database’ stands-out as a option.

In fact a couple of open source but powerful Data governance platforms/frameworks — Apache Atlas and Amundsen — use a graph-based metadata repository. Apache Atlas is the foundation for some of the popular commercial platforms such as Microsoft Purview and Atlan whereas Amundsen was built by Lyft engineering for their internal use and then open sourced.

Let us now try to depict some of the metadata elements that are integrated into the lake (as discussed in the last blog) as objects in a typical graph model and see how it would enable some of the metadata integration use cases.

Partial Graph model depicting the interlinking of metadata objects

In the diagram, the circle or nodes represent the instances of the metadata objects. The capsule next to the node indicate its object type. The properties or attributes are listed next to the nodes. The relationships are depicted as the arrows linking the nodes. The arrows indicate the nature of the relationship and a couple of them indicate their properties (such as the DQ score).

Table and Column metadata are ingested from a Data catalog platform. The Data classifications (Data Class) and associated policies (Access policy) coming from a Data Protection platform. The applied Data quality rules (DQ Rule) generated based on the Data Class and their data quality scores (DQ Score) from the data quality platform. Implemented policy (Access policy) comes from the Data Security / Access governance platform.

Leveraging the power of linkages, we are able to propagate the ‘DQ Score’ and the ‘PII flag’ (or the sensitivity labels) downstream. The quality of the consuming report (Customer Churn Analysis) can thus be calculated based on the propagated DQ scores from its sources. Any data issues in the report can also be analyzed tracing back to its source through its linkages or its ‘Lineage’.

Do you now see the power of integrating all of these metadata objects into one Lake or a Repository - the ‘1+1=3’ scenario? Let me know your thoughts and similar experiences.

Implementing the Metadata Lake…

Written by Anand Govindarajan