Essential Features of Data Catalogs

Martin Zahumensky
Ataccama
Published in
7 min readJul 29, 2020

Modern enterprises are data-driven, making effective data management one of the top priorities for companies. A data catalog is an essential part of a data management strategy, and enables users to easily find, understand, and trust their organization’s data.

Data catalog essentials

Below I summarize the top 6 essential features you should look for in a data catalog solution. In the second half of the article, I’ll shed some light on “advanced” functionalities, which are a must if you want to ensure the solution is used (and loved) by all company users long-term.

Data ingestion & data discovery

To implement an effective data catalog solution, you need to be able to connect it to all (or at least the majority) of company systems — applications, databases, files, and even external APIs. Good data catalogs contain a number of pre-built adapters to allow for easy connectivity. They automatically discover all metadata from systems, such as table names, names of attributes, constraints, etc.

It’s essential that data discovery is not a one-off activity; instead, the data catalog should scan sources continuously to discover new data sets and keep a history of data as well.

Search—to let people find the data

One of the most important features of a data catalog is Search and Find functionality. A data catalog should be the “Google” for all of your company data and metadata. It should be smart, and quickly find relevant data for users, even if they don’t know exactly what they are searching for. It should help users discover new and most trusted data sets with a single click.

Business glossary

Knowing what tables or fields are in which systems is not enough—you have to be able to link them to business terms in order to explain to end users what specific data means. This is why business glossary functionality, even if basic, is essential.

A business glossary is the “FAQ” of your company, and explains the meaning of the data, e.g. what “Days Past Due” means and how it is calculated. Even seemingly simple terms like “active customer” can be defined inconsistently: is it a customer who took a loan 5 years ago and already repaid it, or is it a customer who actively deposits money each month? Can an employee be an active customer?

A business glossary should be used across the whole data catalog but should also be integrated with external applications such as business intelligence (BI) tools to enhance reports. This is an essential feature, as it will help you decrease the number of questions and amount of back-and-forth in your organization, whether about definitions of business terms used on a regular basis in different departments, the meaning of data in unknown attributes, or how a particular report was filtered.

Metadata management & templates

Good data catalogs allows you to freely add additional metadata, tag your terms with things like a data category (e.g. sensitive, GDPR, PII related, track business owners) and any other important information. They also allow you to manage any kind of metadata, not only about data but also about things like reports, APIs, servers, or anything else in your landscape.

Data lineage

Data lineage helps users understand the origin and destination of any data asset in a data catalog, how the data was transformed or enriched on the way to obtaining the final result, how different pieces of data are related to one another, and so on. Data lineage is essential for meeting regulatory requirements for the traceability of calculations and data preparation. As such, it should be considered an essential part of any data catalog solution.

Data marketplace

This is a more recent trend in metadata management solutions. As a data Catalog is a central place for users to find data, it’s both obvious and logical that the user would also want access and to be able to use the data from this place. Essentially, if the data catalog tool allows users to download the data set or connect it to their BI tool of preference or other applications, and at the same time the tool can ensure that access policies and restrictions are applied according to data domain and role of the person in organization, it becomes a kind of marketplace where employees can “buy” or go shopping for company data.

Less obvious features ensuring the long term use and life of a data catalog

It’s one thing to have a data catalog in a company. Whether or not users successfully adopt it and start to use it is another. I’ll now share some of my takeaways from 15 years of experience with data governance projects, and what I see as a “must” when it comes to ensuring a modern data catalog is adopted and sustained.

Always up-to-date: AI does the manual work for you

A lot of the things mentioned above are done manually by users of the data catalog solution. This is usually a time consuming process, requiring great deal of effort by company employees, especially when the solution is rolled out. Over time, however, the data tends to become obsolete. Users then stop using the solution because the catalog is incomplete — data is missing or outdated. Imagine going to your catalog to look for the term “marketing consents” and finding out that your colleague Jane is the owner, but no longer works at your company. Or you might find a data set that’s a few years old. You’re unlikely to ever go back to the catalog and you might even start to discourage your coworkers from using it.

This is precisely why you need automation. AI and machine learning can be applied in many areas to help users:

  • Scanning source systems for new data; detecting and documenting new data items
  • Automatically profiling data to give users info about what’s inside the data
  • Automatic domain detection (finding out what’s inside the data) to keep things like GDPR attributes up-to-date, discovered, and with an assigned business owner according to the domain or system where the data comes from
  • Detecting similarities in data, and trying to guess the relationship between data points in different data sources. This also includes detecting duplicate data, and allowing users to join or merge data from different source systems.

Data quality monitoring & anomaly detection

Users may be wary of using data, especially where they’re unsure if they have the right source or if the quality is dubious. The ability to monitor data quality and how it changes over time can be embedded directly in the data catalog, helping users understand if and how they can trust or use a particular data set. Detecting anomalies or sudden changes in data and notifying users about such events is important, allowing errors to be corrected continuously.

A catalog is for every user, and the user experience must be part of the product strategy

It’s possible to use Excel as a data catalog. However, the key to ensuring long-term use by users is usability. The tool you choose has to have this as part of its DNA.

A catalog is a tool for both business and technical users. The catalog has to be accessible to everyone. Advanced features should be reserved for data stewards and more advanced users.

Wrapping up with “social features”

The user experience is created through subtle and simple things like the ability to rate a data set, comment on it, share it with coworkers, etc. While simple, these features are key to data catalog adoption.

It’s critical to understand that while just 1% of your company will create and update the content of your catalog, 99% of users will consume it.

The more “likes” a content producer sees, the more they will see value in keeping that thing alive. The more likes a user sees, the more they will understand that they’re looking at something useful.

Don’t just rely on crowdsourcing. Automation keeps a catalog up-to-date, and is a must for the long-term survival of your data governance initiatives. Any tool you select for the job should help you achieve this.

My name is Martin Zahumensky and I’m the VP of Product Strategy at Ataccama, a leader in the data management and governance spaces. Check our our website to learn more about what we call “self-driving” data management and governance.

Or stay in touch on social:

LinkedIn

Twitter

--

--

Martin Zahumensky
Ataccama

Data Management & Visualization enthusiast, working as Head of Product & Engineering @ Ataccama, co-founder & ex-CEO of Instarea, and endurance triathlete.