Data Cataloging and Governance using OpenMetadata at Thndr

Published in

Thndr

7 min readJul 21, 2024

Thndr App is a digital investment platform with the purpose to simplify investing. Thndr facilitates access to the right tools and resources that can empower investors with the means to achieve financial freedom.

At Thndr, we have growing number of users and with that we constantly enhance our schema configs and it’s hard to keep track of changes happening within metadata. Since it’s a financial application, we have a core responsibility to protect sensitive customer data; hence, underscoring the significance of data governance framework.

When we were looking for solutions, we tried different tools and we came across OpenMetadata, which is an open-source tool to get complete insights regarding our data from all sources, keep track of changes happening within the schema and tables and for detecting PII sensitive and non-sensitive PII columns, building glossary, and for having data quality alerts. Thus, it’s an all in one tool for data cataloging, data governance, data quality, data lineage, and advance monitoring of data using custom metrics.

When the data increases, it’s difficult to manually keep track of sensitive columns within data, data being fetched and updated on time, quality of data, and changes happening within the data present in data lake or data warehouse. Thus, it’s the responsibility of the data engineering teams to set and communicate a responsible data governance model and data cataloging framework. This helps the data teams to be aware about the changes happening within their schema, tables, stored procedures, and views.

OpenMetadata Features

Metadata Versioning: OpenMetadata captures the changes within the schema along with changes in the volume of the data. It tracks all the changes happening within the schemas. For instance, if there is a new column added into the table, it will detect schema changes.
Data Profiling: Openmetadata has an embedded data profiler which compute metrics on tables, such as unique records, null proportion, row count, data trends, etc.
Data Lineage: Supports a comprehensive lineage for all data assets by capturing the relation between entities. Traces the path of data across tables, pipelines, and dashboards.
Central Repository for Metadata: A central store to integrate metadata from different sources in the data ecosystem. Uses standardized Schemas and APIs. Brings together the power of multiple applications — data discovery, quality, observability, profiling, user collaboration, lineage, and much more. It fetches DDLs, stored procedures, views, tables, and column details for the configured schemas.
Column Level Lineage: In OpenMetadata, we primarily compute column-level lineage through SQL query analysis. Lineage information is available from various sources, such as ETL pipelines, data warehouse, databases, dashboards, and so on.

Getting Started with OpenMetada:

There are two ways to deploy OpenMetadata. The first method is to deploy on Kubernetes using helm charts and the second method is to deploy using docker image on virtual machines using docker swarm. OpenMetadata uses different types of internal storage (PostgreSQL, MySQL) and for indexing and searching, it uses Elastic search. It is possible to deploy database and elastic search on the same machine in dockerfile or use AWS based managed RDS and Elastic search. The good thing about OpenMetadata is that it also provides google authentication and role based access control.

Metadata Ingestions with OpenMetadata

OpenMetadata provide multiple connectors through which it can fetch metadata from and run multiple types of ingestions on it. Some of the connectors include postgres, Mysql, superset, Glue, S3, Redshift, etc.

In order to create catalog in OpenMetadata, create connection with the database by clicking on Settings and then select Databases and click on Add New Service. Select service as redshift or the service that is needed for ingestion. Enter the credentials of redshift and click on save, it will test the connection initially.

Openmetadata provide five types of ingestion based on the need. For data cataloging, click on settings->databases and select the database and click on ingestions. Select Add Metadata Ingestion and choose the ingestion workflow. Initially, start with metadata ingestion.

After choosing metadata ingestion, specify database name, schema name, and table. Regex pattern can also be provided. This ingestion will be scheduled and it will record all changes happening within the metadata.

Run the ingestions and if it fails, the error can seen from the logs if logs were enabled during ingestion.

Data Cataloging with OpenMetadata

Data Catalog is built on top of all the metadata from the databases or data warehouses to make it easy for the users to search by schema name, table name, column name, and filter them based on multiple options. This will help them to flag sensitive columns. OpenMetadata offers inbuilt data cataloging feature with many different datasets and tables, each with varying access levels for different users, a data catalog provides a single, unified user experience for quickly discovering these datasets.

Once the metadata and profiler ingestions are completed, the metadata, DDL, views, and stored procedures will be available and can be explored through data catalog. In the search bar, one can discover data through table name, schema name, column name, etc. It can search from multiple datasets at a time. It also has few filters like if you want to filter on the basis of database, domain, tier, service, schema, column, or table type (regular, view).

Data Profiling with OpenMetadata

One of the best feature of OpenMetadata is that it provides all the information regarding the tables itself. It not only fetches the metadata for the schema tables but also compute metrics

Profiling data and communicating quality across the organization is core to OpenMetadata. While numerous tools exist, they are often isolated and require users to navigate multiple interfaces. In OpenMetadata, these tests and data profiles are displayed alongside your assets (tables, views) and allow you to get a 360-degree view of your data.

When the profiler is completed, it shows column profile for each table with inbuilt metrics like Null percentage, Unique percentage, distinct percentage, row count, value count, size of data, etc.

Data Governance with OpenMetadata

Data governance plays an important role in safe guarding confidential data especially PII columns. This is important so that the access to this data can be prevented across the organization and access should be granted to only relevant people. Manually identifying PII columns is time consuming and can lead to errors or leaving behind certain data; thus, OpenMetadata uses machine learning to auto classify PII columns into sensitive and non-sensitive. It uses different policies like GDPR to flag these columns.

id_number is marked as sensitive using PII auto tagging during profiler ingestion. Likewise, it classifies email, phone number, address, account details, etc. as PII Sensitive. On the other hand, it marks first name, last name, etc as PII Non-sensitive.

Data Quality with OpenMetadata

Data is received in raw format from different applications/ operational databases in production and variety of sources; hence, it may have missing values, whitespaces, invalid characters, outliers, incorrect data types, duplicates, etc., which can be problematic for Data Analyst and Data Scientist. Noisy data can ruin the ML Pipelines and skew the analytical results and won’t help in drawing conclusion from the data. Thus, with OpenMetadata, we can detect different types of data quality issues like duplicates, null count, and any custom SQL query to check any constraint like missing records, etc. If the test case fails, an alert can be configured on slack, email, teams, etc. Based on these alert, transformations can be done on the data.

Data Lineage with OpenMetadata

OpenMetadata offers a detailed solution for tracking and visualizing data lineage, which is crucial for understanding data transformations and usage within an organization. It supports lineage for data warehouses, data lakes, databases on both table and column-level traceability. The lineage can be added using an ingestion and once the profiler is completed, we get a visualization representing the connectivity and relationship between the data. This is helpful when you want to track origin of your data.

And… To Conclude!

I hope the article provides detailed overview of the latest features of OpenMetadata and how important it is. The latest version has a lot of improvements and they also offer APIs. The best thing is that it’s free and open source.