Enterprise Data Catalog — A modern approach to find, understand and trust your data

Sachin Prasad
Dec 30, 2019 · 10 min read

Let’s face it — We are in an era where our phones itself can hold whopping 100 + GB of data and not to speak about the cloud storage subscriptions that augment our storage capacity to terabytes with a mere spend of 10$ a month. If a single phone can generate so much data, imagine how much data an enterprise needs to deal with on daily basis. Businesses across the globe are increasingly leaning on their data to power their everyday operations and with Artificial Intelligence wave, it’s quite possible and apparent that more the data one has the more value you will leverage provided you have well organized data management practices.

You have the best minds (read data scientists) with best of breed ML model waiting to get trained on but alas!! you don’t seem to find the dataset which can get you going with the next biggest idea to promote your products or analyze customer buying patterns or even to cut down costs in your business operations.

Per Gartner, the biggest challenges for Data Management is to be able to simply FIND THAT DATA.

Still not sure if you are facing the same challenges as others ? If so, are you able to relate to some of the below comments that you would heard in strategy and planning meetings?

“I spend more time looking for data, than I do actually analyzing it!”

“Our data is sitting in multiple sources, but I don’t know which data sits where!”

“We have many different data ecosystems across the enterprise, but we have no way to share data artifacts across them! Our users are busy re-producing data assets that they don’t know already exist”

If you can relate to any of the above, then chances are that you need to start by looking into operationalizing an Enterprise Data catalogs — A modern way to find catalog and trust data. Think of it as an index in a book which has 1000s of pages and getting to a chapter is as easy as referring to the page where that chapter resides.

Let’s look at the key pillars of a Data Catalog platform which are now becoming must haves to reduce overall time to get value from your heaps and heaps of data:

  • Discover— How quickly the data sources can be connected to and analyzed
  • Enrich — How much and how quickly the discovery outcomes are categorized into meaningful ways using tools like business glossaries
  • Contribute and Govern — Does the platform provides ways to collaborate among users to author and manage entities in data catalog
  • Consume — Once cataloged, how easy it is to operationalize its consumption to the place where it’s needed while ensuring its security.

IBM Cloud Pak™ for Data (CPD) is a fully-integrated data and AI platform that modernizes how businesses collect, organize and analyze data and infuse AI throughout their organizations. Among other features around building AI solutions, the platform provides a best of breed enterprise catalog management and in this article, we would explore the four pillars of data catalogs as mentioned before to see how these help in accelerating your next innovative data science project which you have been losing your sleep over.

Discover

How Many

The first step in building the data catalog is to be able to connect to the data where it resides and extract the valuable information for indexing. CPD provides 50+ out of the box connectors to connect to structured and non-structured data source On-premises or on cloud. If you have a standard data store, chances are that CPD provides a wizard-based way to connect to that store and this is one the cornerstone of any discovery project i.e. to be able to connect to wide variety of data sources.

How Much –

Scanning or discovery is not only about discovering the meta data such as table names and column data types… yes, they are foundational, but won’t it be nice if the system can tell more than just that? CPD discovery can detect data patterns during the scans to provide more meaningful results. For example, it can detect data classes such as zip codes and credit card numbers or can distinguish between names and addresses. The platform also runs data quality rules to detect how good or bad the data quality is so to give an insight into how fit this dataset could be for a ML project. These options with proper sampling strategy help build a very good picture of the data set being discovered.

How Fast –

Enterprises deal with thousands of data sources each having thousands of schemas and in turn tens of thousands of tables. Its critical to be able to scan them good but more importantly scan them fast. Some of these sources change rapidly and you should be able to re-scan them within a certain time window to keep your metadata repository up to date. Keeping this organizational need in mind, IBM leverages spark clusters to run these discovery jobs at scale and they are aptly named as Fast Scans. These scans are at least 10 times faster than a traditional scan and gives enterprises an opportunity to run multiple scans very quickly to have an accurate picture of what these sources hold.

Enrich

As discussed, the discovery jobs work hard to detect various data classes by looking at the sampling data however, one nice feature which we did not discuss is its ability to link/tag discovered assets to business terms and lingo that your organization might have using machine learning models with natural language processing.

IBM Cloud Pak for data business glossary provides a framework to capture and manage the enterprise’s common business vocabulary. In a nut shell, business glossary defines not only the data vocabulary across an entire enterprise but ensures consistency of business terms. It synthesizes all the details about an enterprise’s data assets across a multitude of data dictionaries and organizes it into a simple, easy to understand format. Glossaries bridge the business and technical divide by providing transparency into definitions, synonyms and important business attributes while tying these important attributes to the more technical definitions stored within the various critical system, reports or processes. It also identifies the owners of data and subject matter experts while enabling collaboration between different departments.

A Good business glossary is also considered foundational for having a meaningful discovery processes with categorizations of assets in a business-friendly term. Cloud Pak for data business glossary comes with an out of the box load/import capability to quickly transform and load your existing business glossaries. With the inbuilt core data model with numerous attributes provided out of the box, it’s quite possible to map your current business terms seamlessly into Cloud Pak glossary but in case they don’t, glossary provides extension attributes to define new attributes with its own data types.

Imagine a term “Mobile Telephone Number” which is used to describe a number that customer uses as its primary phone number for an account with your business. The term has a description for enterprise users, it can be organized into one or more categories which it could be at a LOB level, process level or operations level depending on how you organize all your categories. In this case, we categorize it under marketing since the enterprise uses this number for sending mobile promotions from time to time. It can also be assigned to additional categories such as Supply Chain, CRM simply because these processes do use the telephone number for their interactions with customer

Glossary also lets you capture information such as its Type relations, synonyms, aggregations (part relations) and much more. In our example, a Telephone number could be a type of “Personally Identifiable Information” and “Sensitive Information”, Is a part of “Customer Information” and could have a related business term such as “address”

Finally, a graphical representation of the term lets one quickly understand the overall relationships between all these entities.

Governance and Contributions

Cloud pak for data lets data stewards collaborate among each other to build the artifacts such as policies, data classes and business glossaries. Out of box workflows provide a basic choreography of object edits and approvals but additional workflows can be built to match up to enterprise own processes. The workflows ensure a draft version is created so the production version is live for users before a draft version is approved to be published to replace an existing version.

With data exploding while a governance project is trying to contain it, it’s critical that we have all hands on the board to make these projects success and its most often time not easy to categorize, tag and enrich all the data that an enterprise has. It’s not humanly possible for a small group of data stewards to know all the data laying around with various teams across line of businesses.

The knowledge on these data is locked with data owners itself, team who work day in day out with this data know each field, what it means and what its content should be. These knowledge workers are key to augmenting these data catalogs in a much more meaningful way and a data catalog must let such user provide feedback and inputs to assets as they user/browse these catalogs.

Cloud pak for data let end users provide ratings and capture comments which becomes critical to clean and improve the data catalog as more and more users start using it. The platform lets data stewards monitor such comments, feedbacks and ratings to fix assets or augment them with additional information.

Deliver Data

Delivering data you can trust is a discipline that allows businesses to centrally collect data, maintain its accuracy, and publish it under specific rules and policies. The beauty of the approach is that not only does it control data but liberates it for consumption as well. It allows data professionals and to find, understand, and share data ten times faster. Data engineers, scientists, analysts, or even developers can spend their time on extracting value from those data sets rather than searching for them or recreating them — removing the risk of your data lake turning into a data swamp.

Cloud Pak for Data ‘s Data Catalog is not only the place for data owners to curate and govern the data. It makes also data more meaningful for data consumers, because of its ability to profile, sample and categorize the data, document the data relationships, and crowdsource comments, tags, likes and annotations. All this metadata is then made available to consume through full text or faceted search where search results can be filtered based on Data types.

Once a knowledge worker is able to search the repository for a certain dataset, review its quality scores and ascertain its fit with current project by looking at the metadata, columns and may be comments from his peers, he can go ahead and request for access. The data access request form provides an easy interface to provide as much details as needed to let data engineers provision that dataset and deliver it either into a certain Data Science project or simply make it available globally based on the security constraints or the dataset being requested.

As briefly touched upon at the start of this article, Cloud Pak for Data is an AI platform that helps businesses streamline collect, organize and analyze data. We briefly looked at some of the aspects around collecting and organizing data but eventually the end goal is to be able to easily integrate it into data science projects where the value of data is unlocked. Cloud Pak for data provides a concept of Projects which is a pod where data engineers and scientists can collaborate and author assets written in Jupyter Notebooks, R Studio, scala etc. or bring in Connected Assets such as Database tables, disconnected assets like CSVs, Excels, parquet files etc needed for the project. In our case, the data which was requested and eventually fulfilled into a project can be easily imported into for example a Jupyter Notebook by clicking an import button and inserting the asset as Data Frame

Summary

Today, the quantity of data can present a formidable challenge and Enterprise Data Catalog becomes the heart of any data management strategy. Cloud pak for data provides a robust infused data catalog to kick start your journey to the world of AI and strives to deliver value to your organization by equipping you with tools and technology to become agile and disciplined around data management, its discovery, cataloging, ongoing collaborative curation and most importantly delivering it quickly but securely at the point of impact.

Know more -

Cloud Pak for Datahttps://www.ibm.com/analytics/cloud-pak-for-data

Enterprise Data Catalog https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_current/wsj/catalog/get-started-op.html

Test drive Cloud Pak for Data today — https://www.ibm.com/products/cloud-pak-for-data

Cloud Pak for Data

IBM Cloud Pak for Data is an end-to-end Data & AI platform…

Sachin Prasad

Written by

Sachin’s day job includes helping customers build smart apps infused with AI to solve complex problems in a more sustainable way.

Cloud Pak for Data

IBM Cloud Pak for Data is an end-to-end Data & AI platform enabling organizations to collect, organize and analyze all of their enterprise data. Its a well-integrated collection of micro-services built on cloud native architecture that ca be deployeon-premise or any public cloud.

More From Medium

More on Cloud Pak For Data from Cloud Pak for Data

More on Cloud Pak For Data from Cloud Pak for Data

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade