Data Observability & discovery platform— OpenMetadata

Published in

Geek Culture

3 min readAug 16, 2022

Managing Data about data

Data discovery is a crucial first step of the data consumption workflow. Data discovery answers different aspects of data like what is the source, where it is stored, what is the meaning of this data, how recent/relevant this data is, how this data is used by others, and how this data came into its current form (lineage), etc. So, Data Discovery becomes an essential part of a data platform.

Based on the tools selection for four major capabilities like search(solr), attribute lookup (databases), entity relation(graph databases), and regular refresh of metadata (schedulers/queues) multiple companies have built their own versions of metadata platforms. Few of the major ones are Amundsen, DataHub, Atlas, Metacat, Databook, and Marquez. Each product has its own way and specification of collecting metadata. Some support a certain number of sources while some have very limited integration.

In general, the catalog/metadata segment of the data platform has the following shortcomings.

Non-standardized metadata collection
Incompatibility of data catalogs (the need to recollect data)
Limited, not truly company-wide end-to-end data lineage
Absent or insufficient data quality and observability
Undiscoverable ML assets

An open standard for collecting metadata could become a sound solution to the lack of efficient discovery and observability and a solid foundation for the next-gen data platform.

Open Data Discovery Specification (ODD Spec) is an attempt at creating an open-source, industry-wide metadata standard that would enable engineers to collect and export metadata from cloud-native applications, infrastructures, and other data sources.

OpenMetadata

OpenMetadata is touted as Open Standard for Metadata. A single place to discover, collaborate and get your data right.

OpenMetadata has its own specification, which can be found here. Each schema definition is mapped to the data/asset entity type.

Five major Pillars

OpenMetadata takes a JSON-schema first approach to metadata. Metadata schemas define core abstractions and vocabulary for metadata with schemas for Types, Entities, and Relationships between entities. This is the foundation of the Open Metadata Standard.

SAML Protected Metadata APIs — for producing and consuming metadata built on schemas for User Interfaces and Integration of tools, systems, and services.

Metadata store — Organization of entity and relationship graph that connects data assets, user, and tool-generated metadata.

Ingestion framework — a pluggable framework for integrating tools and ingesting metadata to the metadata store. Ingestion framework already supports 50+ well know data warehouses — Google BigQuery, Snowflake, Amazon Redshift, Apache Druid, and Apache Hive, and databases — MySQL, Postgres, Oracle, and MSSQL. It also has connectors for Airbyte, Airflow & DBT.

OpenMetadata User Interface — Easy to use User interface for users to discover, and collaborate on all data.

OpenMetadata components

Server — UI & API
Elastic search— Search & Analytics engine
MySQL — Storage layer for Entity, their attributes & Relationships
Ingestion — Airflow

OpenMetadata features

Support for personas using RBAC
Support for Keyword & Advance Search
Support for Table, column & pipeline Lineage
Proving usage metadata
Support for entities like Topic, dashboards, Pipelines
Support for custom Labels for asset importance
Support for Glossary — universal language to define, standardize, and contextualize data assets
Activity Feeds — shows all change events linked to assets in a single view
Task workflow for raising Request objects for data owners for any changes
Quality, Profiler, and metrics — quality tests supported by Great Expectation, DBT, or other data quality tools
Metadata versioning

Happy cataloging!!!!

Data Observability & discovery platform— OpenMetadata

OpenMetadata

Five major Pillars

OpenMetadata components

OpenMetadata features

Written by Amit Singh Rathore