What is a data catalog?

A data catalog is a relatively new concept in the Big Data space.* The different types of data catalog users fall into three buckets — the data consumers (think data and business analysts), data creators (think data architects and database engineers), and data curators (think data stewards and data governors).

A data catalog’s purpose is multifold. At its core a data catalog centralizes metadata. To be considered effective, a data catalog must:

  1. Centralize all information on the data in one location — meaning the structure, quality, definitions, and usage of the data should be easily accessible from one centralized location.
  2. Allow end users to self-serve — meaning context of the data is already provided via conversations and articles. On the off chance the user still does not understand the data set, the expert behind the data should be visible (and reachable via an integrated messaging tool).
  3. Auto-populate itself to ensure consistency and accuracy (in Alation’s case we use Machine Learning to achieve this).

Data Catalog vs. Data Inventory

The difference between a data catalog and a data inventory is that a data catalog curates the metadata based on usage.

Purpose

The intent is to minimize the number of data silos present in the data environment, reduce time-to-insight, and function as a single source of truth for better, more accurate analytics.

*pioneered by Alation, Inc. in 2012