Data Governance, Catalogs, Discovery and Popular Open Source Tools

Devesh K Chaubey
6 min readAug 9, 2022

Modern organizations generate enormous amounts of data, which can be hard to manage and follow. In this article we will understand below concepts:

  1. What is Data Discovery
  2. What is Data Catalogs
  3. What is Data Governance
  4. Data Governance vs Data Catalog
  5. Top tools for Data Governance (Open Source)

What is Data Discovery

Data discovery is the business-user-oriented data science process of visually navigating data and applying advanced analytics in order to detect patterns, gain insight, answer highly specific business questions, and derive value from business data.

What is Data Catalog

A data catalog is the inventory of all data assets in an organization that helps data professionals find the most relevant data for any analytical or business purpose. It serves as an inventory of data and provides the necessary information to evaluate the fitness of data for intended uses. It also helps analysts and other data users find the target data they need for specific purposes.

What is Data Governance

It is the process of organizing data and managing its availability, usability, security, and integrity in an organization for efficient use of data in the company’s strategic activities.

It involves various tools, specialists, regulations, and performance metrics to ensure consistent and reliable insights. With well-designed data flows, organizations can define who can access data, what actions they can take in relation to it, by what methods, and much more.

Data Governance vs Data Catalog

While data governance identifies data owners, stewards, and users, the data catalog shows the data assets of an organization and where they’re located. In a nutshell, it helps users get a handle on their data. As a result, different data users know exactly where to go when data questions arise. In the arsenal of data governance capabilities, a data catalog has moved from a “nice to have” to a “must have” due to the growth in data volume. A data catalog is a core component of data governance.

Popular Open Source Data Governance Tools

  1. Amundsen
  2. DataHub
  3. Atlas
  4. TrueDat
  5. OpenMetadata
  6. Magda
  7. Egeria

Amundsen

Amundsen is a data discovery and metadata engine developed at Lyft for improving the productivity of data analysts, data scientists and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data. The project is named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole. Amundsen is hosted by the LF AI & Data Foundation.

More details can be found here. The latest release of Amundsen is 7.1.2 with a strong community support. You can keep an eye out for the developments here.

DataHub

DataHub is an open-source metadata management platform for the modern data stack that enables data discovery, data observability, and federated governance. It was originally built at LinkedIn to meet the evolving metadata needs of their modern data stack. Read about the architectures of different metadata systems and why DataHub excels here. Also read our LinkedIn Engineering blog post, check out our Strata presentation and watch our Crunch Conference Talk. You should also visit DataHub Architecture to get a better understanding of how DataHub is implemented.

Several organizations have already deployed this in production as you read this. The latest release of DataHub, 0.8.42, is in August 2022. More details can be found here.

Apache Atlas

Atlas is a scalable and extensible set of core foundational governance services enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem. It provides open metadata management and governance capabilities for organizations to build a catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts and the data governance team. More details about features can be found here.

The latest release of Apache Atlas, 2.2.0, was in August 2021. Development cycles are slow compared to other tools, click here for download details.

Truedat

Finally, there is TrueDat, which is arguably the only full-fledged open-source data governance tool on this list. TrueDat was created by BlueTab (now an IBM company) after understanding the market’s needs as a data solutions provider and finding a gap in the data governance space.

Its an open source data governance tool that helps clients become data-driven companies. They offer a solution based on open source technologies, acknowledged in the marketplace, that offers a very high coverage of typical data governance requirements.

With the latest stable version, v4.48, released just in July 2022, this is one of the most mature open-source data governance tools out there. More details.

Open Metadata

OpenMetadata enables metadata management end-to-end, giving you the ability to unlock the value of data assets in the common use cases of data discovery and governance, but also in emerging use cases related to data quality, observability, and people collaboration. Learn how OpenMetadata tries to solve the metadata problem and the features it provides in the following video.

OpenMetadata is a new and fast-evolving community, you may follow the official roadmap here.

Magda

MAGDA is an acronym that stands for Making Australian Government Data Available. Magda was developed by CSIRO’s (Australian Commonwealth Scientific & Industrial Research Organization) data sciences arm, Data61. With Magda, data analysts, scientists and engineers can easily find useful data with powerful discovery features and make data-informed decisions with confidence.

Magda is definitely under active development, as the roadmap suggests. More details can be found here.

Egeria

Egeria was launched in 2019, and is maintained by the Linux Foundation’s AI & Data arm. It is an open source project dedicated to enabling teams to collaborate by making metadata open and automatically exchanged between tools and platforms, no matter which vendor they come from.

The development has been at quite a swift pace after its launch. Currently its at the v3.10 version. You can check out the information regarding the upcoming features and fixes in the official roadmap.

Conclusion

Here’s a concise matrix that summarizes the major data governance features you might be looking for in your data governance tool. For simplicity’s sake, the matrix values have been kept to Yes and No, however, these tools implement the same features with differing levels of sophistication and maturity.

Source: https://atlan.com/open-source-data-governance-tools/

Licensed Products

Also, there are many licensed products available for Data Governance. Some listed below:

  1. ASG Technologies
  2. Ataccama
  3. Collibra
  4. erwin by Quest Software
  5. IBM
  6. Informatica
  7. Io-Tahoe
  8. OvalEdge
  9. SAP
  10. Talend

Below are some reference articles:

Top 10 Data Governance Tools for 2021 | Spiceworks

A data governance tool is defined as a tool that aids in the process of creating and maintaining a structured set of…

Apache Atlas Alternatives — Amundsen, DataHub, Metacat, Databook

7 Popular Open-Source Data Governance Tools in 2022 | Atlan

Thanks for reading!

If you liked my article and want to encourage me to get more such updates, please follow me on Medium.

--

--

Devesh K Chaubey

Lead Principal Data Engg and Tech Enthusiast experienced in developing BigData, AI/ML, Data Science projects using latest tech stack.