A Practitioner’s Guide to the Data Catalog

Petr Travkin
9 min readAug 2, 2023

--

As any transformation a Data Governance journey involves three obvious major components: people, processes and technologies. So, where to start? It depends on the business goals, what your company culture is and how fast your company prefers to move in any of these three areas. Some companies choose to launch an enterprise program and start with people (e.g. organisational structures, ownership, etc.) and processes (e.g. policies, standard operating procedures, etc.), others create a small enthusiastic data management group and start a data democratisation initiative promoting offensive Data Governance in a practical way — through a Data Catalog implementation. Any of these styles have their own challenges, advantages and disadvantages, but good news is that the technology market today can offer a tool for any Data Governance implementation style, types of user community and required adoption speed.

Before going into details I would like to make a disclaimer that this article neither promotes any particular market player or type of solution nor aims to provide a comprehensive market overview. It’s a practitioner to practitioner knowledge sharing and “I’d far rather be happy than right any day.” So, if you are of a different opinion on anything written below, odds are you are right as well based on your experience. It’s not mathematics.

Types of Data Catalogs

Let’s start with looking into what the Data Governance technology market can offer and what types of Data Catalogs exist. Roughly there are 4 main categories of Data Catalogs.

Types of Data Catalogs
  1. Stand-alone solutions offer key and additional data cataloging components within a single tool. Commercial and open source offerings are available and examples include Alation, Atlan, data.world, Zeenea, Amundsen, DataHub.
  2. Platform solutions offer key data cataloguing functions with modules providing additional capabilities like Data Quality, Data Privacy, some even MDM. Examples include Ataccama, Collibra, IBM, Informatica, Precisely, Talend.
  3. Cloud native Data Catalogs which provide key components mostly limited within the cloud service provider environment. Use cases such as orchestration and ETL-processes are the main focus. Examples include AWS Glue, Azure Purview, Google Data Catalog (part of Dataplex).
  4. Tool-specific Data Catalogs (add-ons) which support a specific tool. For examples within the area of business intelligence by providing key components as well as purpose related additional cataloguing features. A good example would be Tableau Catalog.

Looking into two last categories Databricks Unity catalog which is gaining traction with the speed of light is an interesting case as it initially could be considered as tool-specific one, but with all the latest developments it is now closer to the cloud native ones or even stand-alone.

There is also a fifth category worth just mentioning which can be called a data services catalog for agile software development or data engineering teams. This type of catalog provides not only metadata about various types of products available, but also connection points be it a Kafka topic or an API and can serve as a developer portal as well. A good example is backstage.io created by Spotify. Since it is not a Data Governance tool I will not go into details.

Data Catalog maturity levels

This is an indicative way of dividing into maturity levels and borders can be blurred. However in practice these four main level have been observed.

Data Catalog maturity levels

L1 — Technical metadata hub. It is a metadata registry for data available in the data platform with ad-hoc curation based on crowdsourcing enabled by advanced users. It performs mostly metadata ingestion from various data sources on-prem and cloud with ad-hoc data modelling and use by advanced users (e.g. data analysts) to find data to build advanced analytics applications​. Sometimes it can be a good start for enabling data democratisation especially in agile environments in the “from chaos to structure” implementation approach which pertains certain risks (see below).

L2 — Curated data inventory. It is a curated data registry with foundational governance capabilities, data classification and user collaboration. Metadata can be fetched from various places including other data catalogs (e.g. cloud native). Integration with communication systems (e.g. Slack) is possible via API and plays a key role for data curation. Since data becomes more structured, data development can leverage that for data search and understanding context. Data Lineage becomes more important and should be provided up to the level of analytics applications​.

L3 — Data Governance Platform. It is a catalog integrated with Data Governance processes where automation of tasks is happening and it becomes a single point for data onboarding, assessment and metrics collection. Data Governance brings several new requirements as Data Quality, Data Classification and executing workflows. These features can either belong to the catalog itself or be provided by 3rd party tools via API integration. Since data is curated and governed it can be used in business applications consumed by business users​.

L4 — Enterprise Data Marketplace. It is a single point of data discovery and access in the enterprise for all categories of data users. Data Marketplace can be either internal only or span across multiple external data consumers and providers, thus API integration with external systems is required​.

Moving from one level to another might require additional capabilities to enable growth and sustainable adoption. Let’s look into core and additional data catalog capabilities and define what is necessary for each level.

Data Catalog capabilities

Data Management capabilities provided by a Data Catalog сan be divided into these major categories each containing capabilities which might be required at different levels of maturity.

Data Catalog capabilities
  1. Data Inventory (L1+) allows to register data sources, organise and describe data by ingesting and curating business, technical and operational metadata. This capability includes Data source connectivity, Data sampling, Business Glossary, Data Dictionary, Metadata Management and Data Lineage.
  2. Data Assessment (L1+) performs the evaluation of data with fitness for use, which includes Data profiling, measuring data risk via classification, PII detection and tracking of data usage to understand how popular datasets are or perform audits. Data Quality assessment also falls into this capability though is likely to be either provided by an additional module of a platform type catalog (e.g. Collibra, Informatica) or sourced from a 3rd party tool via API integration. Either way it is critical to have Data Quality information in the Data Catalog to complete fitness for use assessment.
  3. Data Discovery (L1+) enables users to locate the data asset they need via google like search, exploration and recommendations. This capability is a key for the success of a Data Catalog adoption and sustainable growth of the user community. It is important to highlight that some Data Catalog solutions separate this capability into a Marketplace add-on allowing not only to combine external and internal datasets, but also making it an online shop experience providing the option of requesting access via a shopping cart.
  4. Data Governance (L3+) enables data curation activities via defining roles and responsibilities, rules (fullness of asset curation), policies (e.g. data retention or archiving), tasks automation and standardisation via workflows (e.g. change asset metadata or request access to a dataset) and manual or automated tagging including sensitive data definition.
  5. Data Collaboration (L2+) enables communication and metadata crowdsourcing via tagging, rating, reviewing, sharing and texting. This is a key capability to facilitate data curation. With a reasonable amount of non-invasive governance can boost the tool adoption and metadata quality.
  6. AI automation and assistance (L2+) facilitates data curation by supporting users and taking over manual tasks, enabling data catalogs to scale. Most of the capabilities potentially can be supported by AI functions to a certain extent, e.g. in the area of data ingestion, data labelling, classification and search.
  7. Adoption tracking and Audit (L3+) allows to monitor and measure data catalog performance, analyse user behavior for changes tracking and log users activity to analyse tool adoption progress. Some solutions have embedded and customisable dashboards to make this task a pleasant experience.

Maturity indication above is not strict and some features might be relevant to different levels. What is important to understand is that maturity level growth means scaling up and growth of user community and curation demand which in turn will require more automation and AI augmentation.

Data Catalog Implementation

As mentioned above Data Catalog can be implemented at different stages of Data Governance program and have various roles. These are the three approaches observed in practice each of them having advantages and risks.

Iterative governed approach ​based on data sources/data domains with planned governance enhancements​ starts with the awareness creation plan, prioritised data domains, key roles available from the start. It enables fast and safe business user onboarding thus maximising business value.

What to consider:

  • High upfront planning and alignment efforts
  • Minimum viable training should be provided to key roles
  • Data Catalog tool should be carefully selected based on detailed requirements
  • Limited collaboration at the start and more centralised control

When it might not work:

  • Agile end-user community of advanced data professionals might not need upfront highly governed data catalog and can do curation via crowdsourcing and organic stewardship efforts
  • Open-source or cloud data catalog with limited capabilities and unfriendly UI

From chaos to structure​ aimed to bring all the metadata in and let users collaborate to curate and data governance to evolve​ gradually. Agile end-user community of advanced data professionals doesn’t need upfront highly governed data catalog and can do curation via crowdsourcing and organic stewardship efforts. Bringing all metadata in at once can help reveal duplicate datasets and provide a comprehensive picture on initial Data Quality state via profiling.

What to consider:

  • Training should be provided to all advanced catalog users
  • Data Catalog tool should be carefully selected based on detailed requirements
  • License/Usage costs should be carefully considered as some data catalog solutions charge per the amount of datasets profiled and volume of metadata loaded

When it might not work:

  • Open-source or cloud data catalog with limited collaboration, profiling and sharing capabilities
  • Highly regulated data environment with sensitive data
  • Governance-first approach to data management

Mixed​ approach with different parts of the catalog following its own approach​ and view permissions applied to restrict access. This fits mixed skill level user communities and prioritised data domains. It is possible to start adding business value immediately for part of the domains and grow other domains organically via crowdsourced curation. Some key roles should be available from the start and others emerge organically. Advanced users are not limited with highly curated datasets.

What to consider:

  • High user access security set-up effort
  • Minimum viable training should be provided to all catalog users
  • Data Catalog tool should be carefully selected based on detailed requirements (especially security)
  • Highly depends on DG operating model type (centralised vs federated)

When it might not work:

  • Open-source or cloud data catalog with limited security capabilities
  • Centralised DG Operating model with limited representation within data domains

What approach to take depends on multiple things including but not limited to Data Governance strategy, business goals, company culture, DataOps practices and user community.

Most likely in any approach on a high level the following steps should be taken to enable a successful data catalog implementation and adoption:

  1. Assess your needs and goals to map them to Data Catalog capabilities and create efficient enablement plan​
  2. Review your data processes and tech landscape to define required integrations and customisations​
  3. Review your Data Governance model or create one to enable Data Catalog adoption and operational efficiency​
  4. Create thorough implementation plan including MVP phase and ensure smooth execution to streamline value generation​

Before starting the MVP take some time to prepare and think of the following aspects of the future solution:

  • What would be the initial Critical Data Elements, data domains and data sources?
  • Who will be your data domain champions and data stewards? Can these key people allocate time to support the initiative?
  • What level of Data Catalog are you planning to build during MVP (see above)?
  • What would be key Data Catalog capabilities you would like to start with?

When preparation is done start with setting-up a security model with roles and permissions as well as a metadata asset model which fits into your Data Governance requirements. Some catalogs also have on-premise components for metadata harvesting which require set-up and it can take some time. After this most probably you can go ahead, connect your first data source and see metadata flowing in!

--

--