Musings #1: Problem of “Big Metadata”

Shakti garg
2 min readNov 20, 2022

--

In my current job as a data engineer/architect, this is the title in my mind for the challenge I am facing lately. In my own sarcasm, this name attributes to the problem of “Big Data” which enticed me to the world of big data 8–9 years back.

What is it? In all big data solutions, I think there are two major factors: one is obviously distributed systems(with all their intricacies) and the other, rather ignored is “metadata about data”. This metadata always gives edge, whether it is metadata on tables and partitions in hive metastore for hive/spark query optimisation or taxonomies/tags on datasets for business mapping.

I have been using “Hive Metastore” for long time as the source of metadata for the hive, spark and presto/trino systems. AWS glue is another I have used for a while in some projects but point is that metadata is growing at such a rapid rate that we have scenarios where data size in “Hive Metastore” is now in TBs and in some cases, querying metadata(over RDBMS) is taking more time than actual big data. Though there is always growing data tables and their partitions to blame and that’s where data lifecycle management has come to the top of priorities with data archiving and evaluating partition strategy.

But over the weekend, I realized it will only increase centralization more, with platform teams bringing more red-tapism and single point of failure :)

Then, how to usher democracy and socialism in system instead? Maybe, a federation of metadata, how republics like India/USA work. Hierarchy of metadata units in a Tree-like data structure to give that O(log n) insert and search performance, along with transactional integrity!

--

--