Photo by Iñaki del Olmo on Unsplash

Data Observability & discovery platform— OpenMetadata

Amit Singh Rathore
Geek Culture
Published in
3 min readAug 16, 2022

--

Managing Data about data

Data discovery is a crucial first step of the data consumption workflow. Data discovery answers different aspects of data like what is the source, where it is stored, what is the meaning of this data, how recent/relevant this data is, how this data is used by others, and how this data came into its current form (lineage), etc. So, Data Discovery becomes an essential part of a data platform.

Based on the tools selection for four major capabilities like search(solr), attribute lookup (databases), entity relation(graph databases), and regular refresh of metadata (schedulers/queues) multiple companies have built their own versions of metadata platforms. Few of the major ones are Amundsen, DataHub, Atlas, Metacat, Databook, and Marquez. Each product has its own way and specification of collecting metadata. Some support a certain number of sources while some have very limited integration.

In general, the catalog/metadata segment of the data platform has the following shortcomings.

  1. Non-standardized metadata collection
  2. Incompatibility of data catalogs (the need to recollect data)
  3. Limited, not truly company-wide end-to-end data lineage
  4. Absent or insufficient data quality and observability
  5. Undiscoverable ML assets

An open standard for collecting metadata could become a sound solution to the lack of efficient discovery and observability and a solid foundation for the next-gen data platform.

Open Data Discovery Specification (ODD Spec) is an attempt at creating an open-source, industry-wide metadata standard that would enable engineers to collect and export metadata from cloud-native applications, infrastructures, and other data sources.

OpenMetadata

OpenMetadata is touted as Open Standard for Metadata. A single place to discover, collaborate and get your data right.

OpenMetadata has its own specification, which can be found here. Each schema definition is mapped to the data/asset entity type.

Five major Pillars

OpenMetadata takes a JSON-schema first approach to metadata. Metadata schemas define core abstractions and vocabulary for metadata with schemas for Types, Entities, and Relationships between entities. This is the foundation of the Open Metadata Standard.

SAML Protected Metadata APIs — for producing and consuming metadata built on schemas for User Interfaces and Integration of tools, systems, and services.

Metadata store — Organization of entity and relationship graph that connects data assets, user, and tool-generated metadata.

Ingestion framework — a pluggable framework for integrating tools and ingesting metadata to the metadata store. Ingestion framework already supports 50+ well know data warehouses — Google BigQuery, Snowflake, Amazon Redshift, Apache Druid, and Apache Hive, and databases — MySQL, Postgres, Oracle, and MSSQL. It also has connectors for Airbyte, Airflow & DBT.

OpenMetadata User Interface — Easy to use User interface for users to discover, and collaborate on all data.

OpenMetadata components

  • Server — UI & API
  • Elastic search— Search & Analytics engine
  • MySQL — Storage layer for Entity, their attributes & Relationships
  • Ingestion — Airflow

OpenMetadata features

  • Support for personas using RBAC
  • Support for Keyword & Advance Search
  • Support for Table, column & pipeline Lineage
  • Proving usage metadata
  • Support for entities like Topic, dashboards, Pipelines
  • Support for custom Labels for asset importance
  • Support for Glossary — universal language to define, standardize, and contextualize data assets
  • Activity Feeds — shows all change events linked to assets in a single view
  • Task workflow for raising Request objects for data owners for any changes
  • Quality, Profiler, and metrics — quality tests supported by Great Expectation, DBT, or other data quality tools
  • Metadata versioning

Happy cataloging!!!!

--

--

Amit Singh Rathore
Geek Culture

Staff Data Engineer @ Visa — Writes about Cloud | Big Data | ML