Why data catalogs are data governance rock stars?

Vincent Rejany
5 min readJun 8, 2020

--

Data catalogs smell like teen spirit!

Data catalogs, data catalogs, data catalogs! A new buzz word? What happened over the last few years to put metadata management back into the headlines and for it to climb the famous Gartner Hype Cycle for Data Management for becoming “the new black” according to the advisory firm? 451 Research has even stated that “There is a case to be made that the data catalog is the most important data management breakthrough to have emerged in the last decade” (Aslett 2018). Why is there such enthusiasm? Why are data catalogs the new data management rock star? Are they different from traditional metadata management and data dictionaries that millennial data scientists cannot know?

Let’s immediately break the myth, technically speaking. They are not. It is only a question of exposure to a new audience and therefore, a less technical user experience. However, I assume that the major driver resides in the awareness that accumulating data without proper governance can only result in one obvious consequence: a B I G mess. A mess that includes an ever-increasing risk in terms of security, decision-making process, and in the end, a lack of trust in data. For example, in the early 2000s, as the internet was growing exponentially, there was this empty space that Google took, and became the reference tool for searching, finding, and evaluating content for relevancy and in the end for telling us what we should look at. Similarly, as organizations today struggle to maximize the value derived from their ever-growing volumes of data, the focus is no longer on “having” data, but on “knowing” your data to break the 80–20 ratio between (lost) time spent in searching data and doing data preparation versus real analytics and decision making.

Let’s explore more about Data Catalogs, so we can find out why they are such data governance rock stars!

Don’t look back in anger chief data officer!

Yes Chief, don’t look back and even in front of you in anger. The difficulties of data management have intensified at a steady pace over the past several years. Your organization is struggling to get and maximize the value from its data, and the following are three main reasons for this that explain why data catalogs have been emerging:

Data Catalog Adoption Drivers
  • Data proliferation: Your organization has never managed so much data, and more data that is spread over multiple locations.
  • Regulatory pressure: Your organization is now heavily scrutinized by industry, state, and national regulations (GDPR, CCPA, PIPA, PIPEDA, KVKK, and so on) that are asking for transparency and accountability.
  • Data democratization: Your data consumers are requesting more and more data, but at the same time they want to know where it comes from, and how reliable it is. They ask for the end of tribal knowledge and the advent of data democracy.

These three drivers explain why data catalogs have become so popular versus the former metadata management approach. End users are no longer able to spend more time looking for relevant, adequate, up to date, qualitative, and reliable data, than they spend analyzing data. Data catalogs are key in self-data service strategies by being the entry point for the next valuable actions with data. On the other hand, by identifying sensitive data before it’s applied to business analytics, data catalogs reduce the impact of potential breaches while meeting all industry and government regulations.

Do I wanna know Data Catalogs?

In its report, “Data Catalogs are the New Black in Data Management and Analytics,” Gartner gives the following definition: “A data catalog maintains an inventory of data assets through the discovery, description, and organization of datasets. The catalog provides context to enable data analysts, data scientists, data stewards, and other data consumers to find and understand a relevant dataset for the purpose of extracting business value” (Gartner, Inc. 2017).

Gartner’s definition does not defer from historical metadata management as it does not focus on what makes data catalogs today so trendy: automation and collaboration. Excel-based or IT-driven data dictionaries are over, and the amount of data is too important and does require automation for scaling. Data consumers want to access data and to enrich, comment, and challenge the use and the quality of data.

Let’s dare to give a definition: “A data catalog is an automated collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need. It also serves as an inventory of available data and provides information to evaluate the fitness of data for intended uses.” In a few words, a data catalog is your organization’s metadata social network!

Data catalogs are one of the main pillars of agile data governance as they allow organizations to create and make available for a non-technical audience a snapshot of their entire information ecosystem. Data cataloging accelerates analysis by minimizing the time and effort that analysts spend finding and preparing data. Anecdotally, it is said that 80% of self-service analysis without a data catalog is spent getting the data ready for analysis. Using the data catalog can cut that percentage from 80% to 20%. By providing a good understanding of the information present in an organization’s data catalogs supports digital transformation strategies.

Data Catalogs: I Promise

Analytics can get you answers from data. However, only a data catalog will tell you where to find that data and everything you should know about it!

In some businesses where information is still too siloed, users have challenges when finding and identifying data they can trust and usually come with the following questions: Where can I find data? Who uses it? What are my goals? Are they of quality? By centralizing data knowledge and through a simple UI that allows you to search for data sets for reporting, analysis, integration, and data migration projects, data catalogs intend to do the following:

  • allow data citizens to find the data they need in an efficient way
  • empower organizations to quickly invent, discover, manage, and understand all their data
  • move from tribal to centralized and crowdsource knowledge
  • ingest new data sets and the use of new data faster
  • become the foundational layer for driving data governance, quality, and information security policies
  • foster collaboration between business users and IT and to contribute to the shared understanding of the information

In terms of benefits, data catalogs contribute to increasing efficiency, as they allow analysts to short cut the time, they need to qualify the correct data. They also support data governance and risk mitigation by identifying personal and sensitive data, and by allowing you to establish and spread best practices in terms of data management and data quality. Finally, data management is simplified as new data sources can be onboarded more quickly and key assets can be easily identified and monitored, as redundant and untapped data can be detected and remediated. In the end, the data ecosystem gets rationalized and more agile.

In my next article, we will look at Data Catalogs key features which make them the new kids on the block of metadata management!

--

--