Metadata as Big Data

Introducing Pathfinder

Published in

EnterpriseDataMap

3 min readFeb 24, 2022

What is the Problem ?

Successful enterprise data processing systems have a tendency to grow out of their well managed walled-gardens. This happens because they need additional resources and tools found outside the garden, e.g. offered by a Cloud provider, or because they start to connect into other enterprise systems.

Traditional metadata catalogs do not scale well to handle this organic growth, depending on static and predefined processes to manage and curate data within some well defined perimeter. This makes enterprise data management an intractable problem as the data is constantly being moved and processed between these heterogeneous systems by a large number of independently managed tools.

Significant value can be enabled by gathering, linking and enriching metadata at enterprise scale allowing cost reduction through, for example identifying cold data, removing unnecessary data duplication or better scheduling of workflows. New value added services can be introduced with the guarantee that sensitive data is being used in the way it is intended and being properly managed and protected.

This is analogous to the ways in which enterprises have unlocked hidden value in their data through big data processing. The processing of heterogeneous metadata at scale is in fact a type of big data problem with the same classic 5 Vs of: velocity, value, variety, volume and veracity.

Introducing the Enterprise Data Map

The metadata is extracted from source systems into a location where it is gathered, linked and enriched. This can be thought of as analogous to a data lake in a big data system.

As most of the value in the metadata is gained through understanding the complex relationships between entities, these are stored as a graph that allows the entire enterprise data to be mapped, such that questions about where data is stored, how are those storage system protected, where does data flow, can be answered at the level of the enterprise rather than a single isolated processing system.

Moreover, metadata stored in the graph can be enriched just as in a classic data lake by multiple independent processes that can add new relationships and entities that were not in the raw data, e.g. that data pipelines resemble each other, which can possibly be merged enabling cost saving, or data at a certain classification is not being properly handled, thus exposing the company to possible litigation. The data map is open and scalable meaning that advanced machine learning techniques can be brought to bear to extract latent information hidden in the complexity.

We enable the creation of the Enterprise Data Map with a system called Pathfinder.

How Does Pathfinder Work ?

We are using a new event-based approach to data management which discovers what is actually occurring in the data ecosystems. Metadata is collected from heterogeneous data processing and storage systems, distributed through streaming and combined and analyzed via specific data enrichment processes. The metadata is extracted from sources by collectors, serialized into a graph of entity/relationships and stored as a sequence of events in a Kafka-based change log. The events are generated and propagated in real-time. The enrichment processes consume these events and then write new metadata back to the change log. The compacted log contains the canonical set of metadata for the enterprise and its evolution over time. It can be materialized in different processing systems for specific downstream applications.

Further information:

Pathfinder demo:

Technical Paper: ACM Middleware 2022

https://www.researchgate.net/publication/365360912_Revisiting_Data_Lakes_the_Meta-data_Lake

Technical Paper: IEEE BigData 2021

Pathfinder: Building the Enterprise Data Map

Pathfinder is a novel way of discovering, curating and disseminating heterogeneous metadata from multiple distributed…

ieeexplore.ieee.org