Empowering Data-Driven Organizations Part 1: Data Observability with OpenMetadata

Natalie Zeller
NI Tech Blog
Published in
6 min readJun 25, 2023

In this two-part blog post series, we will explore the concept of Data Observability and reveal how Natural Intelligence implemented it using OpenMetadata and OpenLineage.

In Part 1, we will discuss the significance of data observability tools in the current data landscape, and explore their ability to address challenges faced by data-driven organizations, focusing on how Natural Intelligence leveraged OpenMetadata to enhance data observability and discovery.

In Part 2, we will tackle the challenge of extracting lineage automatically, revealing a game-changing integration between OpenLineage and OpenMetadata, implemented by Natural Intelligence.

Motivation

As organizations collect and process more data, it becomes increasingly important to track and manage metadata effectively. Metadata provides essential information about data assets, including their source, structure and usage. This information is critical for understanding the data path, ensuring data quality and making informed decisions.

Effective metadata management becomes challenging when data is distributed across multiple platforms, consisting of numerous tables and fields. Several pain points arise in this context:

  • Keeping documentation accurate and up-to-date becomes difficult as metadata undergoes frequent modifications.
  • Locating specific fields dispersed across multiple tables and databases becomes a daunting task without adequate tools for data asset discovery.
  • When troubleshooting data issues, it becomes crucial to accurately identify the schema modifications that may have caused the problem.
  • Comprehensive understanding of the data flow and usage is essential for making well informed decisions and identifying potential risks proactively.

To address these challenges, we need a centralized metadata platform that provides discovery and observability capabilities.

Data observability, an essential aspect of effective metadata management, offers the ability to measure, monitor, and comprehend data behavior, quality, usage and dependencies.

Starting our journey towards enhanced metadata management

At Natural Intelligence, we navigate a diverse data ecosystem with various storage technologies. Our data lake relies on S3, with AWS Glue serving as the metadata store and Redshift as our data warehouse. To maintain infrastructure operations and support applications and data science pipelines, we utilize dedicated Airflow servers for executing Spark pipelines across multiple data assets, including the data lake, Redshift and Mysql DBs. We also have other data producers and consumers, such as SalesForce and Tableau.

Given the importance of data discoverability and observability in our work, we sought an effective and user-friendly solution to manage metadata across our organization.

During this search, we came across OpenMetadata, an open-source platform offering centralized, collaborative and automated metadata management capabilities. It effectively addresses the pain points associated with data management, allowing organizations to overcome challenges related to data asset discovery, troubleshooting, understanding data flows and tracing lineage.

Our decision to adopt OpenMetadata was based on several key benefits it provides — including its simple architecture for ease of use and maintenance, a broad range of features with continuous evolution, and comprehensive support and documentation.

Diving into OpenMetadata

OpenMetadata offers an integrated metadata management solution through several components: a web-based interface, an API server that handles requests, a metadata store (using MySQL/PostgreSQL), a search and analytics engine (using ElasticSearch), and an ingestion framework running on Airflow.

Furthermore, it enables metadata and lineage to be pushed from external systems via APIs or plugins.

Overview of the OpenMetadata components and high-level interactions, sourced from OpenMetadata documentation

Metadata ingestion

The easiest way to start the pull-based ingestion and extract metadata from your data sources to OpenMetadata is to use the provided connectors through the UI.

In just a few clicks, you can easily set up a new service and establish a connection to your data source. From there, you can choose the desired ingestion type, such as metadata, profiler, usage, lineage, and more. The configuration allows you to define filter patterns to include or exclude specific data assets. The ingestion can be scheduled to run at your preferred interval, whether it’s hourly, daily, weekly, or manually.

The UI provides visibility into the status of the ingestion process, giving you the ability to start or stop it as needed, and access the logs for monitoring and troubleshooting purposes.

Creating a MySQL ingestion

Discovering data assets

After setting up and connecting the ingestion process, discovering data assets becomes an easy task. You can search, discover, and explore tables, columns, pipelines, lineage and other entities. The ingestion system collects metadata from various data sources such as Redshift, MySQL, Glue, Airflow, Kafka and more, enabling discovery across all ingested assets.

Searching for cataloged entities in OpenMetadata UI

Data Entity Versioning

Versioning of data entities helps to simplify the troubleshooting process. By examining the version history, you can determine if a recent change caused a data issue, or use the alerting system to get automatic notification on such changes.

Table’s version history

Table Lineage

Relationships and dependencies between tables and views can be tracked and visualized through OpenMetadata UI. Table lineage showcases the flow of data through tables and views, helping users understand how data is transformed, combined, or derived from a table to a view.

To accommodate table lineage, OpenMetadata utilizes pull-based ingestion, where metadata is extracted from various databases, and ingested into the platform.

Once the table lineage information is ingested into OpenMetadata, it can be accessed and visualized using the user interface.

Table lineage extracted from mysql pull-based ingestion

Pipeline Lineage

Another crucial aspect is understanding the data movement through pipelines, analytics platforms and other data processing frameworks. This type of lineage is visualized in the user interface as well, after being ingested into OpenMetadata.

To handle pipeline lineage, OpenMetadata uses push-based ingestion, where metadata and lineage information are pushed from external systems, such as data pipeline services. For instance, ingesting this data via Airflow can be accomplished using a lineage backend plugin provided by OpenMetadata.

However, this approach requires explicitly specifying the input and output tables for each data processing task, which is error-prone and difficult to maintain.

In the next post in the series, you will find out how we automated this process using OpenLineage customizations.

Embracing OpenMetadata

At Natural Intelligence, we understand the importance of data observability and have integrated OpenMetadata into our data management processes. The power of this tool allows us to achieve better visibility into our data landscape, by cataloging all data assets across different data sources within the organization. With this centralized catalog, we can now quickly search and discover data assets, view their schema and understand their usage patterns. Having this comprehensive view of the data helps us make informed decisions and establish policies/procedures to manage data quality, security, and compliance effectively.

In this blog post, we’ve taken a sneak peek into some of the key features offered by OpenMetadata. If you’re intrigued, go ahead and explore more features in OpenMetadata documentation.

Stay tuned for part 2, where we’ll reveal how we took the lineage to the next level!

--

--