Technical vs Product Data Catalogs: Which one is best for you?
What is a data catalog and why do organizations use them?
Data catalogs are managed repositories of metadata that can be used for data governance, data security, and general management of data and assets.
A catalog provides an easy place for the organization to see all at once what exactly is all the data they have and effectively enforce rules across different teams, VERY useful when working with multiple stores and formats.
There are a plethora of features across the many different catalogs on the market, but important ones are:
- Lineage: Tracing from source to consumer, including the different transformations and “forks” of a given piece of data across its lifecycle.
- Security: Notoriously difficult enough as it is, but row/column level security and Attribute-Based Access Control is a must with the sheer volume of data that is stored.
- Analytics/AI Enablement: Catalogs primarily use metadata, which takes considerably less computational resources than processing source data and allows you to access ALL your data across departments and teams.
- Unification: If your catalog allows for federated querying, multi-cloud support, and/or is optimized for geo-distributed stores, you can unify not just across structured and unstructured data but also formats and vendor locks.
Technical Data Catalog vs Product Data Catalog
Data Catalogs often come in two flavours: technical (Glue, Metastore, Gravitino, etc) and consumer facing (Informatica, Collibra, etc).
A technical data catalog has features built around big data and distributed use cases but ultimately are meant for infra teams or are API heavy, but a data product catalog will have features more so meant for self serving, so the expectations around how you interface with the catalog is completely different and the promises around the user experience that entails.
In practice, a data product catalog (one way serving) should be sync’d to and pull directly from a technical catalog (backend management). The way you interact with the product catalog (especially in terms of UI) is expected to be much clearer and simplified, since the catalog then acts as a marketplace.
A technical catalog also includes physical information such as data connections, partition management, in conjunction with security features like fine grained access control- which is important for lakehouse federation and supporting multiple query engines. Apache (incubating) Gravitino for instance is focused on lake serving and data unification for RAG-based data consuming, which has very different needs from traditional data governance. On the other hand, a data product catalog is likely to be centered around the schema, views, and ownership with metadata fields that are more useful for business and operational contexts. Oftentimes as well you will see data product catalogs focused on detailed lineage tracing and knowledge graph building.
Which one is right for you?
It is likely that any organization that has to deal with large amounts of data will have some variation of both a technical data catalog and a data product catalog, both are not mutually exclusive. However, depending on the company size and amount of data they are working with, they may invest in data product and technical data catalogs differently.
For example, a technical data catalog is a must have for any company managing large amounts of data across various stores, formats, and distributed systems. This is a natural evolution as more and more people need to access the data with varying permissions and it becomes very difficult to streamline the management of data assets. The important distinction to which one is best for you is in what your team is struggling with.
A data product catalog may be best for you if:
- Your data is already well modeled and managed, but you have troubles serving it in a self contained way
- You already have a strong data product culture and data mart, but require a centralized place to manage them
- Your consumers are not technical, and there is a hard requirement for a large degree of separation between your infrastructure and analytics teams
- Your organization has very strict zero trust policies and require that consumers are a one-way endpoint
- The ideal managers of your catalog are designated data stewards and data product managers who have strong business intelligence requirements
A technical data catalog may be best for you if:
- Your organization is actively struggling with data quality, ingestion, observability, and formatting across many different types of data sources
- You have heavy needs for federation due to legal requirements around where data can be stored and who can access it
- Your organization is on a mix of both legacy and modern data infrastructure across multiple clouds, making it difficult to manage and serve
- Your consumers need to have end to end visibility of the data and have requirements around data masking and artifact management
- The ideal managers of your catalog are a data infrastructure team that also has performance and optimization requirements
As you can see, the “serving” aspect is a huge part of data product catalogs, where the one directional relationship is emphasized and safer for those who will interact with it. This is incredibly important if you have data that is relatively straightforward and does not require a catalog to provide the backend for quality and ingestion frameworks. The downsides however is that data product catalogs can be limiting in high performance, high velocity data systems such as stream processing for real time analytics or unstructured datasets for large language models.
Behind a data product catalog though, could easily be a technical data catalog. Technical data catalogs are made to solve two of the biggest issues data engineers face: interoperability and compliance. An effective data catalog can, and should shield users from having to interface with cumbersome, messy data sources that are spread across different clouds, data centers, and formats. An effective technical data catalog can then take it a step further by helping engineers feed frameworks into each other and find bottlenecks by providing insight into the physical aspects of the data and its transformations. It is much less so about providing a curated consumer experience than it is about providing a system for end-to-end visibility and management.
All in all, only you can make the best decision for your organization’s data. It is extremely important to consider the requirements you have and solutions you require out of a catalog before fully committing to a product.
Interested in deploying a technical data catalog yourself? Check out Apache (incubating) Gravitino, an open source technical data catalog optimized for big data & AI.