In a Data-Driven World, Data Catalogs Are Necessary Tools

Published in

Poatek

6 min readFeb 8, 2023

A data catalog is a piece of software specialized in documenting everything related to the data pertaining to a company or a project. Think of the following scenario: you’ve been working on a project for a long time, built most of the database architecture, know every column name of every table in it, and then you are allocated to another project. Would the people remaining in the project be able to work with the database with ease? This is where a data catalog comes in handy.

So, how would a data catalog help with this kind of situation? The idea is that the data catalog is able to centralize the database’s metadata, which is information about the data in the database. Such information can be a variety of things, such as who created a given table, who are the most frequent users of a set of tables, what is a sample of rows from a given table, what is the data type of a certain column and which columns are primary keys. All this information is useful when exploring a new data set — think of all the time you spent learning about a data set when joining a new team.

Data catalogs are not just useful for documenting a project’s data, but also to help democratize data access inside a given company. You might have different sources of data in a company from different sectors and teams. Think of marketing, HR, accounting and even the specialized data team. All of these teams might gain value from accessing data generated in another team. HR might have information on whether or not employees are satisfied with office dependencies and that could be valuable for a marketing campaign for example.

Knowing this, I have gathered four different options of data catalog services and software, one of which is a build-over-buy solution and the other three deal with scenarios where cloud-based solutions are in place. Let’s have a look at them.

Microsoft Purview

Our first data catalog service is in fact a data governance service. Microsoft Purview is one of many services provided in the Azure cloud stack and is intended as a solution to assure the data governance framework in the enterprise scenario. Its goal is to gather metadata from multiple sources, including on-premise data sources and multicloud sources, and facilitate data discovery and access management to this data.

Diagram taken from Microsoft Learn website

Microsoft Purview has got a lot of attention in recent years as its list of features has become bigger and bigger, with new connectors being released at a relatively fast pace. However, as of this blogpost publication, it is said that Purview is best for Microsoft stack mostly and many data engineers point out that the data lineage feature has to improve. To learn more about the Microsoft Purview platform you can check [1].

AWS Glue Data Catalog

The next data catalog service we are discussing is AWS Glue. In contrast with Microsoft Purview, AWS Glue is a data integration service mostly focused on gathering data from various sources and making it easy so that data engineers and data analysts to transform and store data. This makes Glue more of an ETL service rather than a data catalog admittedly, however, the service is most useful with good data aggregation so that you can work with all your data in one place. That is where our data cataloging goals are served in Glue: the Glue data crawlers will connect to all data sources configured and will infer schemas, gather metadata and then store this metadata in Glue tables. To read more about Glue and have a full walk-through with this service you can check [2].

Diagram taken from AWS Glue documentation

It is important to note that Glue, even if it has the ability to connect to external data sources such as several JDBC’s, is mostly helpful in AWS-based projects, even more than Purview is with Microsoft solutions. In both cases it is a subtle vendor lock-in strategy that is not trivial to avoid when using platforms that aim to integrate many data sources. This specific problem is target by our two last data catalogs solutions.

Databricks Unity

Founded by Apache Spark creators, Databricks is a company that focuses in providing products that enable analysts and engineers to use Spark and IPython notebooks. One of their products is the Unity catalog, which focuses on being an agnostic solution to data discovery and governance. Unity has several connectors that can integrate most data stacks, being able to pull data from AWS S3, Azure Data Lake Storage, Hive, and more. This makes it so that Unity doesn’t have a privileged cloud solution and users don’t need to bother whether or not their data cataloging needs to find another home in case of data migration for example.

Unity has all the cataloging features the previous solutions we mentioned have, such as lineage, multiple connectors, and easy data discovery. You can read more about it in [3].

Amundsen

Now, for our last data catalog solution, we will have a build-over-buy solution. Other solutions mentioned are all proprietary and closed software provided as a service by Microsoft, Amazon, and Databricks respectively. This makes it so that users are not able to fully fine-tune the data catalog to their needs in case of, say, an extreme edge-case scenario and the workaround is to use an open-source solution.

Amundsen is an open-source data catalog solution created by Lyft and based on microservices. It has a web interface just like the others mentioned before and aggregates data such as the most frequent users of a given table, sample queries with all columns of each table, and tags assigned to tables so that they can be easily searchable.

The microservices architecture along with the open-source nature of Amundsen makes it so that the community can contribute very thoroughly to its development, which leads to Amundsen having a myriad of connectors ready to serve any data stack most users will need to be cataloged.

Diagram taken from Amundsen documentation of its Architecture, accessed in January 2023

Amundsen is a great data catalog platform and is fast-changing and developing. I studied its implementation in December 2022 and had the impression its documentation was a bit lacking coming back to it in January 2023 I can see great improvements on it. Check Amundsen’s at [4].

I hope these four data catalog options can help readers implement a data governance framework and facilitate data discovery inside any enterprise scenario. With more and more data being generated and gathered every single day, it is important we take caution to not get drowned in it.

References

[1] — Microsoft Purview, its Deployment and how it works across the Microsoft 365 stack | by Andre Camillo | Microsoft Azure | Medium
[2] — AWS Glue 101: All you need to know with a full walk-through | by Kevin Bok | Towards Data Science
[3] — Unity Catalog in Databricks. What is Unity Catalog ? | by Harun Raseed Basheer | Medium [4] — GitHub — amundsen-io/amundsen