Why OpenMetadata is taking the right approach to metadata cataloging

Published in

cisco-fpie

5 min readOct 21, 2021

For at least the past decade, companies have wanted to label themselves as data-driven, machine-learning-empowered, and fully data compliant, but the path to get there has been rocky. Organizations face a whole host of roadblocks which make it difficult for AI/ML engineers and analysts to get their hands on important data. Things like poor discoverability, fragile Extract-Transform-Load (ETL) pipelines, and Personally Identifiable Information (PII) regulations can stand in the way .

Enter the data mesh concept.

The concept of data mesh is quickly becoming an absolutely massive part of the current tech zeitgeist.

In some ways, it represents the utopian data stack where all data is perfectly cataloged and documented…
…where data domains are separated and managed by experts who know the data inside and out…
…where data quality issues are spotted and remediated within minutes.

It is a beautiful thing to imagine, but it is a ton of work to actually achieve. Companies all over the world are putting forth massive efforts to develop their own internal data mesh systems that work for their own individual use-cases.

Data Cataloging

One of the core components of a functional data mesh is having a centralized and indexed metadata catalog. A metadata catalog serves as a repository for knowledge of the data within the mesh. It can help analysts answer important questions about the data such as:

Where is the database that contains our online order information?
What is the meaning of this very obscure looking column name?
What is the quality of this data? Is this data fresh or stale?

Not only are these catalogs important for analysts, but they also serve as an important resource to manage regulation compliance. They provide tooling to allow data engineers to tag data sources that signify that they could contain PII or other sensitive information, giving them visibility into what resources are safe to share, and what resources aren’t.

There are a handful of projects that are already doing great open-source work in this space. The Linux Foundation has been working on their Egeria project for quite some time. WeWork open-sourced their Marquez project. Lyft open-sourced their Amundsen project in 2019. LinkedIn open-sourced their DataHub project in 2020. Now Suresh Srinivas (ex-HortonWorks, ex-Uber), Sriharsha Chintalapani, and their team are taking a unique approach to the metadata catalog concept with their OpenMetadata project.

Do you even Metadata?

When dealing with metadata, you often have two concepts that you have to juggle simultaneously:

Schema Information — the rules for developers to tell them how to integrate data with cross-platform services
Data tagging — flexible metadata chunks which can be used to categorize, filter, and search for data

Both of these concepts deal with the description of data, but there is an important distinction: schema information often exists to be coupled with outside services and needs to be appropriately communicated in developer-land.

If someone changes the description of a table or column, it usually doesn’t lead to anything terrible. Maybe a developer gets confused by how something works — not a catastrophic problem. However, if someone changes a type of a column or removes it entirely, it could have drastic effects for the quality of downstream data products and pipelines.

OpenMetadata is unique in the fact that it takes a JSON-schema first approach to metadata.

JSON-Schema focused workflow

In order to provide the best developer experience, OpenMetadata heavily leverages JSON-schemas for their schema metadata. All modern languages can deserialize JSON into their own data structures, so leveraging JSON as the core schema structure is a no-brainer.

OpenMetadata encourages developers to fetch these schemas off of the web and incorporate the schemas as typings in their own applications. By serving as a centralized schema store, OpenMetadata can help your team ensure that changes in complex data pipelines and integrations are quickly identified and acted upon. This positions OpenMetadata as the single source-of-truth for schema metadata.

A code block which shows an example JSON-schema for a user. — A look into an example of what an OpenMetadata JSON user schema contains.

APIs, APIs, APIs

OpenMetadata is built from the ground up to be powered by SAML-protected REST APIs. This means that it is easy to build bots, integrations, and automation workflows which query and manipulate the metadata store.

Want to fetch a list of tables for a Slack bot? There’s an API for that.
Want to automagically apply a tag to a database after some event? We’ve got you covered.
Want to check the metadata for a Superset dashboard via your terminal? Just cURL it!

It goes without saying that APIs provide an immense amount of flexibility when coming up with powerful workflows. The great documentation provided by the OpenMetadata team is helpful when it comes time for your team to build integrations that rely on metadata.

Lineage

Where I could see OpenMetadata improving is moving towards developing more features aimed at data lineage. We’re seeing a lot of awesome lineage work being done by OpenLineage and DataHub.

OpenMetadata has their own lineage functionalities planned in v0.5 so it’s worth keeping an eye on how they decide to implement it, but I hypothesize that lineage will start to be more and more important as internal data meshes continue to grow in complexity.

While it’s not yet as feature rich as Amundsen or DataHub, I am impressed with how OpenMetadata is taking a developer-friendly approach to the metadata store. Ultimately, a lot of the work done in this space is done between engineers and analysts, so facilitating and improving communication there has the ability to boost productivity, simplify debugging, and generally smooth out the integration and adoption process.

Although OpenMetadata is practically still in its infancy, it shows an great amount of promise. I am very excited to see where Suresh, Sriharsha and the rest of the team take this project in the future.

If you and your team are looking for a metadata platform, I strongly recommend giving OpenMetadata a shot. You can check out the sandbox environment here, attend a weekly meeting, chat them up on the OpenMetadata Slack, or even contribute to the code on the GitHub page.