Speed up Data Catalog Implementation with Automation and AI

Nazar Labunets
Jun 29 · 3 min read
Image for post
Image for post

It’s no doubt true that crowdsourcing is a great data catalog capability. After all, it enables teams and departments to make their tribal knowledge of particular data available to everyone. However, crowdsourcing is more efficient as a second step, after a catalog has been populated and enriched with as much business metadata as possible.

With that in mind, what businesses really need is to automate the generation of business metadata. Why? Because the traditional process of populating a data catalog, which relies on crowdsourcing business metadata, is long and tedious. It’s worth a closer look.

The Challenging Task of Crowdsourcing Metadata

Organizations that want to introduce a data catalog generally assign the task of populating it to a small team of data stewards, who are then responsible for launching a ready-to-use solution for the rest of the organization or department. As anyone who has tried it will know, this is far easier said than done.

Most data catalog solutions connect to data sources and import technical metadata. What that usually means is thousands of tables and⁠ columns named something like XDS_E2121_A_32. These naming conventions don’t make the task easy for the data stewards. While end users and data owners sometimes possess the knowledge of what is really inside those data sets, often they are reluctant to create and maintain metadata for those data sets.

At the same time, existing metadata documentation in the form of, say, Excel spreadsheets is often not easily available. In other cases, it is outdated, incomplete, or inconsistent with other departments’ documentation for the same data source. Database information schemas are no better: they are unlikely to have been updated since the day they were created.

Therefore, it is up to the data stewards to make inquiries and populate the catalog with relevant, up-to-date business metadata, such as business domains, classifications, and data quality indicators. A task such as this takes anywhere from months to years to complete. Effectively, data stewards face what writers call “the fear of the blank page.” Where to start, and how to get from 0 to 1?

Even if we imagine that over the course of a year of painstaking manual population, the data catalog is finally “complete,” let’s not forget about the crucial element of keeping metadata up to date. After all, data is dynamic: schemas change, data comes and goes, human error is, to some extent, unavoidable. How, then, to keep up with the pace of change in data and metadata?

If crowdsourcing alone were the answer, it would require an unprecedented data culture, processes, and accountability. It’s clear that this is not the case in the world today — and perhaps not ever.

So, is there a better, simpler way to introduce a data catalog solution into an organization and keep it useful afterward? Fortunately, yes. The answer lies in a new realm that will soon revolutionize data cataloging: automation.


Learn what 9 data governance, data quality, and data catalog population processes can be automated by reading the full article at ataccama.com

Ataccama

Self-Driving Data Management

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store