Data discovery at scale with IBM Watson Knowledge Catalog

Michał Szylar
11 min readJun 3, 2022

--

How to catalog data in IBM Watson Knowledge Catalog while complying with the governance rules

Data Fabric is an architectural approach to simplify data access in an organization to facilitate self-service data consumption. This architecture is agnostic to data environments, processes, utility and geography, all while integrating end-to-end data-management capabilities. A data fabric automates data discovery, governance and consumption, enabling enterprises to use data to maximize their value chain. With a data fabric, enterprises elevate the value of their data by providing the right data, at the right time, regardless of where it resides.

Watson Knowledge Catalog is the heart of IBM’s data fabric solution. It brings everything together from common workflows to applying the governance and policies needed to help users know their data, trust their data, protect their data and consume their data.

Here in this article we will discuss and show example how Watson Knowledge Catalog can be used to achieve some of key Data Fabric capabilities.

Watson Knowledge Catalog is the component of Cloud Pak for Data, where you can manage the metadata of the data ingested by the platform as well as all the assets playing a role in the organization of these data (governance artifacts). Example of such artifacts are data classes, business terms, data protection rules, reference data and policies.

Importing data sets into a catalog with the objective of making these data sets available for business user, is a process by itself. The main steps of that process involve:

1. Define the scope of what needs to be done: identify the data sources to ingest; define the policies and rules which should govern the catalogued data asset; define and implement the data classes, business classifications, business terms and rules necessary to implement those policies.

2. Create project, add connection and setup metadata import to access the data sources and metadata.

3. Run metadata enrichment and analysis of the data sources, where each discovered data set is classified and associated with the right terms and governance rules. Optionally this process can do a preliminary assessment of the data quality. You can find more information about data quality analysis here.

4. Review the result of the discovery and do any manual correction in the identified data classes and suggested terms.

5. Publish the data assets to the catalog, where data analysts will be able to find and use them in their analytics projects.

6 Data consumer can find and access data in the catalog.

So far the process may still look abstract. Let’s see how this concretely looks like with Watson Knowledge Catalog when we try to implement a simple example.

In the example, let’s assume that we have identified a new database that needs to be added to a new catalog, where data consumer can access data. Let’s also assume that we have some governance policies that we need to implement to ensure that we don’t violate some data protection rules. To make it easier to follow, we’ll keep this example very simple.

Let’s go through that process step by step and look how it looks like in Cloud Pak for Data:

1. Define the scope and governance assets

In this simple exercise, we are going to import all data sets from a relational database (it can be as well AWS S3, Snowflake or csv file — we do support variety of connection types and file formats) into a new catalog and ensure that simple data protection rules are properly applied to them.

Identifying the data source to import and getting the connection details for it is the simplest part of the process. A far more complicated task is to understand which business policies need to be enforced when making the data sets from that source available to business users. In a real life scenario, there may be many different policies to be implemented to be in compliance with regulation laws like GDPR for instance. In our example, we’ll keep it simple to make it understandable and assume that we have a single policy indicating that sensitive data need to be masked when the data are accessed by a business user.

1.1. Define policies

The policy we have chosen for this example sounds simple, but in order to implement it, we need to further define what we understand under sensitive data and how to detect them. You would define these concept in the catalog by creating business terms and business rules providing a clear definition.

Photo by Towfiqu barbhuiya on Unsplash

In a real life scenario, you may start with an industry model providing the terminology and definitions which are common for your industry and customize it with what is specific to your company. In this example, we will start from scratch and simply assume that sensitive data comprise:

· Personal data: including phone numbers, email addresses and social security numbers

Let’s create a new policy to capture the fact that these sensitive data should be masked when accessed from business users.

Creating a new policy
Creating a new policy

So far we have only created governance assets (artifacts) which provide a clear vocabulary and definition so that the users can understand it — we only provided plain english definitions. If we want to automate the process of identifying sensitive data, we will need to connect the policy with a more technical definition of the sensitive data.

1.2. Define data classes

The next step will consist in identifying how to automatically detect the sensitive data. This is something that we will do with data classes. Data classes can be seen as the algorithms that is used by the system to determine that a particular column, based on the data it contains, represents a certain type of information that we may need to govern.

We won’t go into the details of data classes in this article, but for the moment, we need to understand that we can specify the logic a data class as a regular expression, or a list of values or a more complex heuristic that can be used to test if an individual value or a column as a whole matches the data class or not.

While defining the notion of personal data previously, we have already identified a list of data classes which should be mapped to each term. Watson Knowledge Catalog is shipped with a list of predefined classes. We should first check if the type of data we need to detect is covered by those predefined data classes.
In a real-life scenario, you may have to create new data classes in this step, or modify existing ones. In our simple examples, all the data classes that we need are already available in the platform:

· Personal Data => US Phone Number; Email Address, Social Security Number

With data location rules (experimental at the time of the creation of this article), you can also mask data based on their physical location. To learn more about this functionality see documentation page .

1.4. Implement policy using data protection rules

Next, we’ll create data protection rules to enforce our policy. We will create data protection rules for defining how personal data should be masked. In this example I will define the rule so that columns containing the sensitive data are masked by replacing the data with Xs. The data protection rules allow you to use different kind of masking or to restrict the access to the complete data set.

Creating a new data protection rule
Creating a new data protection rule

For documentation purposes, we’ll add the new created data protection rules in the policy that we created at the beginning, so that it becomes clear that these rules are defined in order to implement this policy.

Add data protection rules to policy
Add data protection rules to policy

1.5 Reviewing the scope

The following diagram summarises what we have just done:

· We have identified the policy that needs to be enforced,

· That policy is implemented by data protection rules that will react on data classes.

Enforcing governance policy diagram
Enforcing governance policy diagram

From here, we have defined all the metadata which are necessary to enable an automatic discovery and governance of structured data sets. We can now start with the data discovery process itself.

2. Create project

Now that the scope is clear and the governance artifacts are in place, we need to create project which will be a workspace where we setup the connectivity, import metadata and run metadata enrichment.

Create a new project
Create a new project

2.1 Setup connection

Next, we need to create a connection to the source database to ingest. Cloud Pak for Data provides a rich list of connectors to various type of sources.

In this example, we will connect to a DB2 database. We need to retrieve the source details and credential to use and enter them when defining our new connection.

Create a new connection
Create a new connection

2.2 Metadata import

Once the connection is defined, we can create metadata import.

During metadata import, we can select target (project or catalog), define scope of the import, and optionally schedule metadata import. In this example we will run it once and as a target we will select a project.

Create a metadata import
Create a metadata import

2.3 Metadata enrichment

We need to select data scope, it can be either data asset from the project or the metadata import technical asset.

Create a metadata enrichment
Create a metadata enrichment

In the next step we need to define metadata enrichment objective (for the purpose of this example profiling is enough), select categories that contain relevant governance artifacts (in my case it is [uncategorized] category where I store OOTB data classes and Banking category where I store my business terms, and select sampling.

Category selection is an important step in the process, as it impact which terms and data classes will be used by auto assignments algorithms.

In this example, we will keep a default sampling of 1000 rows per data set and do a term assignment as well as a data quality analysis of the discovered data sets.

The time needed for the analysis will depend on how many data sets need to be analysed. In this example it shouldn’t take more than just a few minutes.

4. Review the metadata enrichment results

Now the results of the analysis can be reviewed. On the following screenshots we see analysis details of all the analyzed assets, and detailed view for one of the discovered tables.

Review the metadata enrichment (discovery) results — asset list view
Review the metadata enrichment (discovery) results — asset list view

We can see that the columns EMAIL_ADDRESS, PHONE_NUMBER have automatically been assigned to the relevant data class.

Review the metadata enrichment (discovery) results — column details view
Review the metadata enrichment (discovery) results — column details view

At this point we have the possibility to review in details the results and do some manual overwriting in case the suggested terms or data classes are not as expected, or go back to the definition of new data classes and terms and repeat the analysis in case we notice that something is missing.

5. Publish to a catalog

Once the review of the results is complete, we are ready to publish them to a catalog where they can be searched, found and used by the consumers.

Let us create a new catalog for the purpose of this exercise:

Create a new catalog enforcing data protection rules
Create a new catalog enforcing data protection rules

Note that in order to protect data access with data protection rules, the option Enforce data protection rules needs to be enabled at the time of the creation of the catalog.

Once the catalog has been created and the users who need to access those data assets have been added to its access list, we can go back to our metadata enrichment results and publish all or some discovered data sets to it.

Publish the discovered assets to the catalog
Publish the discovered assets to the catalog

6. Use the catalogued data sets

Now that the data sets have been published, other users can access the catalog and search and use any of the published data sets. No matter how those users try to access the data (via asset preview in the catalog, or after having added the data set to their project and working on the data set in a notebook), the platform will ensure that the data protection rules are enforced and will mask all data coming from columns identified as containing sensitive information.

This can be seen in the following screenshot, where the data of the columns EMAIL_ADDRESS, PHONE_NUMBER are masked while the other columns are shown in clear form.

Please note that the owner of the data set (the user who published those data sets to the catalog) will still be able to see the data in their original form. So if you want to see the effect of the data protection rules, you have to login to the catalog as a different user or pass the ownership to some other user.

Furthermore, data consumer can use the business term assigned to data assets and columns during the metadata enrichment, to easier find relevant data asset and understand the data in the business context.

Summary

We have seen in this article how you can ingest data sets from a new data source into a catalog in an automated way. We have also seen how you can define the policies and governance rules in the catalog so that they are automatically applied during the data discovery. The process is repeatable and any further discovered data sources would be automatically protected by the rules that we have defined.

We have seen that once the policies and rules are identified, the process of implementing them and ensuring that they are enforced is straight forward and can be used using a single UI. Predefined industry models can be used to accelerate that process of defining the assets.

If you want to try recreate these steps, you can start from one of our Data Governance and Privacy tutorials.

This article has been focused on the analysis of structured data sets, and is an updated version of the previous article by Yannick Saillet.

--

--