Where’s your metadata? Part 2
As I continue my research into different metadata systems, I want to share my experiences and commentary along the way. My exploration of metadata systems is actually a part of my overall exploration of the modern data stack.
During my career, I have been involved in building several platforms and data products. I feel that discussions on data platforms and stacks don’t actively include metadata systems. Perhaps we all take metadata for granted, which would be a mistake.
Metadata should be at the heart of the modern data stack, not at the periphery, not an afterthought.
Development and interest in metadata management have been steadily growing, with many contenders — open source and commercial. Data-first companies (like those listed below) saw an explosion of data, demand for access to that data, and therefore needed to invent ways to manage and govern their data. In the process, each created their own iteration of a metadata system to support data discovery, exploration, enrichment, compliance, governance, and usage. Here is an incomplete list of data discovery/metadata projects:
- Uber’s Databook
- LinkedIn’s Datahub
- Lyft’s Amundsen
- Netflix’s Metacat
- WeWork’s Marquez
- Spotify’s Lexikon
- Airbnb’s Dataportal
- Shopify’s Artifact
As you can see, many teams have been at it for a while — some for several years and some more recently. Commercial products like Alation and Collibra (and perhaps others?) have been around for years. I don’t have experience with them, so can’t say much and won’t cover them. Cloud providers offer their own “data catalog” solutions: Google Cloud Data Catalog, AWS Glue Data Catalog, and Azure Data Catalog. I don’t plan to research these solutions either as they are specific to their ecosystems.
My focus is on open-source systems. I am interested in the needs of teams similar to the ones that I have led, teams that build data platforms and data stacks — trying to understand what should be the considerations for metadata management as part of a modern data stack.
If you are involved in building data platforms and stacks, I am sure you too have seen your share of challenges and needs that resulted in these metadata systems. The need for a metadata system can be felt regardless of the variations in scale, size, and complexity. Everyone building a data-first ecosystem will need to think seriously about metadata management. Pay attention to metadata now rather than wait and regret.
Of all the projects above, Open Metadata (from ex-Uber folks) and Datahub (from LinkedIn) seem the most interesting and contemporary to me.
Open Metadata
Last time, I wrote about Open Metadata which appears to be the newest of this group. I have installed it and studied the APIs. You can play with their demo sandbox without installing.
Datahub
At the time of this writing, Datahub is ahead in terms of development and maturity, and community. Highlights of what I liked:
- Extensible Schema-based approach which is super important (uses PDL, though I prefer standard JSON-based schemas)
- Rich set of GraphQL APIs — it is smart to use GraphQL
- An impressive list of sources to ingest out-of-the-box
- Demo site to play around
Shirshanka Das, who founded and architected Datahub while at LinkedIn¹, wrote this great article on different metadata systems — Popular metadata architectures explained. He compares and contrasts various architectures (1st, 2nd, and 3rd generation) at play among these systems. It is important to understand the various architectures and tradeoffs, do not automatically write off a solution just because it is not the latest generation. Architecture is neither good nor bad, without understanding the overall context and the value it delivers. Architecture aside, my focus as a user/adopter of these systems is on the value these metadata systems can deliver to data teams and users.
So far, I have tried to lay the foundation for exploring metadata systems here and in my previous article.
Going forward, I want to discuss several topics and concepts related to data, metadata, data platforms, and data stacks: use cases, user personas, the role of metadata in modern data stacks, how it relates to Data Mesh², how metadata aids and assists automation, machine learning, and AI. I also want to dig into topics around metadata including versioning, dependency management, and standardization. Let me know if these topics interest you.
What are you doing about metadata? Share your experiences and opinions.
1 Shirshanka Das (ex-LinkedIn) and Swaroop Jagadish (ex-Airbnb) are cofounders of Acryl Data, the company behind Datahub.
2 I first heard about Data Mesh from Todd Fast as we embarked on building our new data platform at OpenGov. Todd joined us from Intuit who are firm believers in Data Mesh. Recommended read: Data Mesh by Zhamak Dehghani.
Originally published at https://deepakalur.substack.com on October 27, 2021.