Building a case for a “Metadata Lake”

3 min readJan 2, 2023

Large organizations typically have proliferation of platforms meant to address similar use cases — either due to M&As or decisions made in siloed groups for their project needs. One day when a there is a top-down push to cut costs, there is a hurried exercise to review and rationalize the platforms.

Similarly in the space of Data Governance, Privacy and Protection, there is a sudden proliferation of platforms. Apart from the common set of features/use cases they address, each of the platforms also bring in their unique strengths. More often than not, these unique features are based on their provenance.

A Data privacy platform that has evolved into more a Data Governance platform comes with very good features on data discovery, classification and privacy/protection processes addressing key regulations (For example: BigID Data Intelligence Platform). Those that have evolved from more a Metadata management solution, come with a robust and extensible metamodel to manage several types of metadata to be governed as well as governance workflows/automated metadata management processes (For example: Collibra Data Intelligence Cloud).

Organizations when reviewing these platforms, see an opportunity where they can complement each other to build really powerful use cases than what they can achieve independently. Thus we are seeing many bi-directional connectors/integrations coming up between these platforms to exchange metadata.

Having been part of such connector developments, I see these integrations more as Point-to-point integrations. We know from our application integration experience that this is not a scalable nor a preferred integration pattern.

Don’t we see a case for a “Metadata Lake” setup where each of the platforms contribute rich metadata that they host, based on their capabilities, into the Lake and consume back what they need from the lake? Truly a case of ‘1+1 = 3’ scenario.

This enriched metadata can be pushed back to the metadata sources (data platforms, data integration platforms, BI platforms etc.) , where they show up right in the platform of use rather than users navigating individually to each of these governance/DQ platforms. Example, a report in the BI platform shows up the data quality metrics for the data assets as well as their glossary definition. A data integration pipeline leverages the observability metrics to drive decision on the loading or terminating the load to avoid downstream issues. Analytics on top of this lake, data marketplace portal, Operational dashboards that help us understand the health of the data platforms…use cases are endless.

Some of the cataloging tools do talk about their platform playing the role of this “Metadata Lake”. Now, based on our data integration/lake experience, should this lake be confined to one platform or should the lake really be an independent store providing the flexibility to expand the scope of what it hosts as we imagine new use cases?

Let us explore the possible implementation options in subsequent posts. Till then eager to hear your thoughts and comments on the above topic!

Building a case for a “Metadata Lake”

Written by Anand Govindarajan