Published in


Speed up Data Catalog Implementation with Automation and AI

It’s no doubt true that crowdsourcing is a great data catalog capability. After all, it enables teams and departments to make their tribal knowledge of particular data available to everyone. However, crowdsourcing is more efficient as a second step, after a catalog has been populated and enriched with as much business metadata as possible.

With that in mind, what businesses really need is to automate the generation of business metadata. Why? Because the traditional process of populating a data catalog, which relies on crowdsourcing business metadata, is long and tedious. It’s worth a closer look.

The Challenging Task of Crowdsourcing Metadata

Organizations that want to introduce a data catalog generally assign the task of populating it to a small team of data stewards, who are then responsible for launching a ready-to-use solution for the rest of the organization or department. As anyone who has tried it will know, this is far easier said than done.

Most data catalog solutions connect to data sources and import technical metadata. What that usually means is thousands of tables and⁠ columns named something like XDS_E2121_A_32. These naming conventions don’t make the task easy for the data stewards. While end users and data owners sometimes possess the knowledge of what is really inside those data sets, often they are reluctant to create and maintain metadata for those data sets.

At the same time, existing metadata documentation in the form of, say, Excel spreadsheets is often not easily available. In other cases, it is outdated, incomplete, or inconsistent with other departments’ documentation for the same data source. Database information schemas are no better: they are unlikely to have been updated since the day they were created.

Therefore, it is up to the data stewards to make inquiries and populate the catalog with relevant, up-to-date business metadata, such as business domains, classifications, and data quality indicators. A task such as this takes anywhere from months to years to complete. Effectively, data stewards face what writers call “the fear of the blank page.” Where to start, and how to get from 0 to 1?

Even if we imagine that over the course of a year of painstaking manual population, the data catalog is finally “complete,” let’s not forget about the crucial element of keeping metadata up to date. After all, data is dynamic: schemas change, data comes and goes, human error is, to some extent, unavoidable. How, then, to keep up with the pace of change in data and metadata?

If crowdsourcing alone were the answer, it would require an unprecedented data culture, processes, and accountability. It’s clear that this is not the case in the world today — and perhaps not ever.

So, is there a better, simpler way to introduce a data catalog solution into an organization and keep it useful afterward? Fortunately, yes. The answer lies in a new realm that will soon revolutionize data cataloging: automation.

Learn what 9 data governance, data quality, and data catalog population processes can be automated by reading the full article at




A blog about AI in data management & data governance—to improve time to market and reduce manual effort

Recommended from Medium

Geospatial search, how do I use thee? Let me count the ways.

New Year Goal : Photo A Day

How to Gain ‘Rough’ Insights from Data through Visualization

Inferring Causality from Observational Data: Hands-On Introduction

Companies Hiring Data Scientists in 2022

Detect starting point and stopping point of wave

Drug Development Knowledge Graph insights using visual and interactive cues

5 Things To Learn As A Data Science Foundation In 2021

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nazar Labunets

Nazar Labunets

Effective communication: images and words at Ataccama.

More from Medium

Federating analytics workloads at HRS with AWS Orbit Workbench

Data Exfiltration Protection with Databricks on GCP

Launching synthetic data within your company? Understand results and possibilities!

Anomaly Detection Part 2: The Bigeye Approach