Unlock Your Data With IBM Watson Knowledge Catalog

Yannick Saillet
Oct 26 · 11 min read

How to ingest data sources into IBM Watson Knowledge Catalog while complying with the governance rules

Image for post
Image for post
Picture by Bo Mei on https://pixabay.com/users/bomei615-2623913/

AI projects — or data analytics in general — require good data in order to be successful. On the other side, the large amount of data that exist in any organization makes them difficult to find and governance policies and regulation laws makes them difficult to share.

In this article, I am going to show, how you can use IBM Cloud Pak for Data solve that problem by cataloging a high number of data sets in a short amount of time and make them available for users, while ensuring automatically that data protection policies are enforced.

In order to set the context, let’s have a look again at the AI Ladder as defined by IBM:

The AI Ladder: Collect, Organize, Analyze, Infuse
The AI Ladder: Collect, Organize, Analyze, Infuse
The IBM AI Ladder

If you are not familiar yet with the IBM’s AI Ladder, you can have a look at this article by Hemanth Manda, which explains it in detail. But in short, what this picture is saying, is that there are 4 major steps that you need to follow in a particular order, if you want to be successful in AI:

This article is primary about the collection and organization steps of the ladder. We are going to use Watson Knowledge Catalog and the rich connectivity provided by the platform.

Watson Knowledge Catalog is the component of Cloud Pak for Data, where you can manage the metadata of the data ingested by the platform as well as all the assets playing a role in the organization of these data. Example of such assets are data classes, business terms, rules and policies.

Importing data sets into a catalog with the objective of making these data sets available for business users, is a process by itself within the main AI ladder. The main steps of that process involve:

Watson Knowledge Catalog Process
Watson Knowledge Catalog Process
Discovering and ingesting new data sets into the catalog

So far the process may still look abstract. Let’s see how this concretely looks like with Watson Knowledge Catalog when we try to implement a simple example.

In this example, let’s assume that we have identified a new database that needs to be added to a new catalog. Let’s also assume that we have some governance policies that we need to implement to ensure that we don’t violate some data protection rules. To make it easier to follow, we’ll keep this example very simple.

Let’s go through that process step by step and look how it looks like in Cloud Pak for Data:

1. Define the scope and governance assets

In this simple exercise, we are going to import all data sets from a relational database into a new catalog and ensure that simple data protection rules are properly applied to them.

Identifying the data source to import and getting the connection details for it is the simplest part of the problem. A far more complicated task is to understand which business policies need to be enforced when making the data sets from that source available to business users. In a real life scenario, there may be many different policies to be implemented to be in compliance with regulation laws like GDPR for instance. In our simple example, we’ll keep it simple to make it understandable and assume that we have a single policy indicating that sensitive data need to be masked when the data are accessed by a business user.

1.1. Define terms and policies

The policy we have chosen for this example sounds simple, but in order to implement it, we need to further define what we understand under sensitive data and how to detect them. You would define these concept in the catalog by creating business terms and business rules providing a clear definition.

In a real life scenario, you may start with an industry model providing the terminology and definitions which are common for your industry and customize it with what is specific to your company. In this example, we will start from scratch and simply assume that sensitive data comprise:

We’ll create business terms to capture each of these definitions.

Image for post
Image for post
Creating new business terms to reflect the identified concepts

Next, let’s create a new policy to capture the fact that these sensitive data should be masked when accessed from business users.

Image for post
Image for post
Creating a new policy

So far we have only created assets in the catalog which provide a clear vocabulary and definition so that the users can understand it — we only provided plain english definitions. If we want to automate the process of identifying sensitive data, we will need to connect those terms with a more technical definition of the sensitive data.

1.2. Define data classes

The next step will consist in identifying how to automatically detect the sensitive data. This is something that we will with data classes. Data classes can be seen as the algorithms that is used by the system to determine that a particular column, based on the data it contains, represents a certain type of information that we may need to govern.

I won’t go into the details of data classes in this article, but for the moment, we need to know that we can specify the logic a data class as a regular expression, or a list of values or a more complex heuristic that can be used to test if an individual value or a column as a whole matches the data class or not.

While defining the notion of personal and financial data previously, we have already identified a list of data classes which should be mapped to each term. Watson Knowledge Catalog is shipped with a list of predefined classes. We should first check if the type of data we need to detect is covered by those predefined data classes.
In a real-life scenario, you may have to create new data classes in this step, or modify existing ones. In our simple examples, all the data classes that we need are already available in the platform:

1.3. Associate the data classes to the terms

In order to have the terms Personal Data and Financial Data automatically assigned to the columns containing these information, we need to associate these data classes to their respective term. When a data class is associated to a term, that term will be automatically assigned to any column detected as containing data matching this data class.

In our example, we will add the data classes US Phone Number, Email Address, and US Social Security Number to the term Personal Data, and will will add the data classes Credit Card Number and Routing Transit Number to the data class Financial Data.

Image for post
Image for post
Associate data classes to business terms

Note that there are other algorithms in Watson Knowledge Catalog which can suggest a term for the analyzed column. These algorithms may base their suggestion on the metadata of the columns and/or previous assignments done by the user. So using a data class for detecting the term is not the only way to do this, but if you have a clear definition of how to detect data that should be associated to a term, using a data class is going to give you the most accurate term assignment, especially if the column has no meaningful name.

1.4. Implement policy using data protection rules

Next, we’ll create data protection rules to enforce our policy. We will create two different data protection rules for defining how personal data and financial data should be masked. In this example I will define the rule so that columns containing the sensitive data are masked by replacing the data with Xs. The data protection rules allow you to use different kind of masking or to restrict the access to the complete data set.

Image for post
Image for post
Creating a new data protection rule

For documentation purposes, we’ll add the new created data protection rules in the policy that we created at the begginign, so that it becomes clear that these rules are defined in order to implement this policy.

Image for post
Image for post
Add data protection rules to policy

1.5 Reviewing the scope

The following diagram summarizes what we have just done:

Image for post
Image for post

From here, we have defined all the metadata which are necessary to enable an automatic discovery and governance of structured data sets. We can now start with the data discovery process itself.

2. Setup the connectivity

Now that the scope is clear and the governance artifacts are in place, we need to create a connection to the source database to ingest. Cloud Pak for Data provides a rich list of connectors to various type of sources.

Image for post
Image for post

In this example, we will connect to a DB2 database. We need to retrieve the source details and credential to use and enter them when defining our new connection.

Image for post
Image for post
Create a new connection

3. Data Discovery

Once the connection is defined, we can start a Data Discovery job. We need to select the data connection to the source to discover, optionally specify which schema should be analyzed, and specify a data quality project to be used as staging area for reviewing the results before publishing to the catalog, as well as what to do during the analysis. In this example, we will keep a default sampling of 1000 rows per data set and do a term assignment as well as a data quality analysis of the discovered data sets.

Image for post
Image for post
Run a data discovery

The time needed for the analysis will depend on how many data sets need to be analyzed.

4. Review the discovery results

After a few minutes, the results of the analysis can be reviewed. On the following screenshot we see analysis details for one of the discovered tables. We can see that the columns EMAIL, PHONE1, PHONE2 have automatically been assigned to the term Personal Data because of the data class associations that we did before.

Image for post
Image for post
Review the data discovery results

At this point we have the possibility to review in details the results and do some manual overwriting in case the suggested terms are not as expected, or go back to the definition of new data classes and terms and repeat the analysis in case we notice that something is missing.

5. Publish to a catalog

Once the review of the results is complete, we are ready to publish them to a catalog where they can be searched, found and used by the consumers.

Let us create a new catalog for the purpose of this exercise:

Image for post
Image for post
Create a new catalog enforcing data protection rules

Note that in order to protect data access with data protection rules, the option Enforce data protection rules needs to be enabled at the time of the creation of the catalog.

Once the catalog has been created and the users who need to access those data assets have been added to its access list, we can go back to our discovery results and publish all or some discovered data sets to it.

Image for post
Image for post
Publish the discovered assets to the catalog

6. Use the catalogued data sets

Now that the data sets have been published, other users can access the catalog and search and use any of the published data sets. No matter how those users try to access the data (via asset preview in the catalog, or after having added the data set to their project and working on the data set in a notebook), the platform will ensure that the data protection rules are enforced and will mask all data coming from columns identified as containing sensitive information.

This can be seen in the following screenshot, where the data of the columns EMAIL, CCN, PHONE1 and PHONE2 are masked while the other columns are shown in clear form.

Image for post
Image for post
A user accesses a protected data asset from the catalog

Please note that the owner of the data set (the user who published those data sets to the catalog) will still be able to see the data in their original form. So if you want to see the effect of the data protection rules, you have to login to the catalog as a different user.

Summary

We have seen in this article how you can ingest data sets from a new data source into a catalog in an automated way. We have also seen how you can define the policies and governance rules in the catalog so that they are automatically applied during the data discovery. The process is repeatable and any further discovered data sources would be automatically protected by the rules that we have defined.
We have seen that once the policies and rules are identified, the process of implementing them and ensuring that they are enforced is straight forward and can be used using a single UI. Predefined industry models can be used to accelerate that process of defining the assets.

This article has been focused on the analysis of structured data sets. A similar process can be implemented for unstructured documents as well.

The Startup

Medium's largest active publication, followed by +731K people. Follow to join our community.

Yannick Saillet

Written by

Software Architect, Master Inventor @IBM — Architect for Data Profiling and Data Quality in Watson Knowledge Catalog on IBM Cloud Pak for Data, IBM Cloud.

The Startup

Medium's largest active publication, followed by +731K people. Follow to join our community.

Yannick Saillet

Written by

Software Architect, Master Inventor @IBM — Architect for Data Profiling and Data Quality in Watson Knowledge Catalog on IBM Cloud Pak for Data, IBM Cloud.

The Startup

Medium's largest active publication, followed by +731K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store