Metadata ≥ Data

Jonathan Seidman
97 Things
Published in
2 min readJul 12, 2019

My first real experience in the “big data” world was helping to deploy Apache Hadoop clusters at Orbitz Worldwide, a heavily trafficked online travel site. One of the first things we did was deploy Apache Hive on our clusters and provide access to our developers to start building applications and analyses on top of this infrastructure.

This was all great, in that it allowed us to unlock tons of value from all of this data we were collecting. After a while though, we noticed that we ended up with numerous Hive tables that basically represented the same entities. From a resource standpoint, this wasn’t that awful, since even in the dark ages of the aughts, storage was pretty cheap. However, our users time was not cheap, so all the time they spent creating new Hive tables, or searching our existing tables to find the data they needed, was time they weren’t spending on getting insights from the data.

The lesson we learned at Orbitz was that it’s a mistake to leave data management planning as an afterthought. Instead, it’s best to start planning your data management strategy early, ideally in parallel with any new data initiative or project.

Having a data management infrastructure that includes things like metadata management isn’t only critical for allowing users to perform data discovery and make optimal use of your data. It’s also crucial for things like complying with existing and new government regulations around data. It’s difficult, for example, to comply with a customer’s request to delete their data if you don’t know what data you actually have and where it is.

While it’s probably relatively straightforward to identify data sets to capture metadata for, and define what that metadata should contain, the bigger challenge can be putting in place processes and tools to capture that metadata and make it available. The fact is you don’t have to find the perfect tool to manage all of your data, and it’s possible there isn’t a single tool that will allow you to effectively manage your data across all your systems. This is definitely a case where even a non-optimal solution will put you far ahead of having no solution in place.

Whether you use vendor provided tools, third-party tools, or even decide to roll your own, the important thing is to have a process and plan in place early, and ensure that you carry that process throughout your data projects.

Thanks to Ted Malaska for some of the ideas that motivated this post.

--

--