How to ingest data sources into IBM Watson Knowledge Catalog while complying with the governance rules
AI projects — or data analytics in general — require good data in order to be successful. On the other side, the large amount of data that exist in any organization makes them difficult to find and governance policies and regulation laws makes them difficult to share.
In this article, I am going to show, how you can use IBM Cloud Pak for Data solve that problem by cataloging a high number of data sets in a short amount of time and make them available for users, while ensuring automatically that data protection policies are enforced.
In order to set the context, let’s have a look again at the AI Ladder as defined by IBM:
If you are not familiar yet with the IBM’s AI Ladder, you can have a look at this article by Hemanth Manda, which explains it in detail. But in short, what this picture is saying, is that there are 4 major steps that you need to follow in a particular order, if you want to be successful in AI:
- Collect: Get access to the data, wherever they are.
- Organize: Ensure that the data can be found, that they are properly described, tagged, classified and of sufficient quality and that whatever policies regulating how the data can be accessed or used are in place.
- Analyze: Build your AI models (or any kind of other analysis) using those data.
- Infuse: Use the built models or analytical results in your business.
This article is primary about the collection and organization steps of the ladder. We are going to use Watson Knowledge Catalog and the rich connectivity provided by the platform.
Watson Knowledge Catalog is the component of Cloud Pak for Data, where you can manage the metadata of the data ingested by the platform as well as all the assets playing a role in the organization of these data. Example of such assets are data classes, business terms, rules and policies.
Importing data sets into a catalog with the objective of making these data sets available for business users, is a process by itself within the main AI ladder. The main steps of that process involve:
- Define the exact scope of what needs to be done: identify the data sources to ingest; define the policies and rules which should govern the catalogued data asset; define and implement the data classes, business classifications, business terms and rules necessary to implement those policies.
- Setup the connectivity to access the data sources.
- Run an automatic discovery and analysis of the data sources, where each discovered data set is classified and associated with the right terms and governance rules. Optionally this process can do a preliminary assessment of the data quality.
- Review the result of the discovery and do any manual correction in the identified data classes and suggested terms.
- Publish the data assets to the catalog, where data analysts will be able to find and use them in their analytics projects.
So far the process may still look abstract. Let’s see how this concretely looks like with Watson Knowledge Catalog when we try to implement a simple example.
In this example, let’s assume that we have identified a new database that needs to be added to a new catalog. Let’s also assume that we have some governance policies that we need to implement to ensure that we don’t violate some data protection rules. To make it easier to follow, we’ll keep this example very simple.
Let’s go through that process step by step and look how it looks like in Cloud Pak for Data:
1. Define the scope and governance assets
In this simple exercise, we are going to import all data sets from a relational database into a new catalog and ensure that simple data protection rules are properly applied to them.
Identifying the data source to import and getting the connection details for it is the simplest part of the problem. A far more complicated task is to understand which business policies need to be enforced when making the data sets from that source available to business users. In a real life scenario, there may be many different policies to be implemented to be in compliance with regulation laws like GDPR for instance. In our simple example, we’ll keep it simple to make it understandable and assume that we have a single policy indicating that sensitive data need to be masked when the data are accessed by a business user.
1.1. Define terms and policies
The policy we have chosen for this example sounds simple, but in order to implement it, we need to further define what we understand under sensitive data and how to detect them. You would define these concept in the catalog by creating business terms and business rules providing a clear definition.
In a real life scenario, you may start with an industry model providing the terminology and definitions which are common for your industry and customize it with what is specific to your company. In this example, we will start from scratch and simply assume that sensitive data comprise:
- Personal data: including phone numbers, email addresses and social security numbers
- Financial data: including credit card numbers, or bank details like Routing Transit Number (RTN)
We’ll create business terms to capture each of these definitions.
Next, let’s create a new policy to capture the fact that these sensitive data should be masked when accessed from business users.
So far we have only created assets in the catalog which provide a clear vocabulary and definition so that the users can understand it — we only provided plain english definitions. If we want to automate the process of identifying sensitive data, we will need to connect those terms with a more technical definition of the sensitive data.
1.2. Define data classes
The next step will consist in identifying how to automatically detect the sensitive data. This is something that we will with data classes. Data classes can be seen as the algorithms that is used by the system to determine that a particular column, based on the data it contains, represents a certain type of information that we may need to govern.
I won’t go into the details of data classes in this article, but for the moment, we need to know that we can specify the logic a data class as a regular expression, or a list of values or a more complex heuristic that can be used to test if an individual value or a column as a whole matches the data class or not.
While defining the notion of personal and financial data previously, we have already identified a list of data classes which should be mapped to each term. Watson Knowledge Catalog is shipped with a list of predefined classes. We should first check if the type of data we need to detect is covered by those predefined data classes.
In a real-life scenario, you may have to create new data classes in this step, or modify existing ones. In our simple examples, all the data classes that we need are already available in the platform:
- Personal Data => US Phone Number; Email Address; US Social Security Number
- Financial Data => Credit Card Number; Routing Transit Number
1.3. Associate the data classes to the terms
In order to have the terms Personal Data and Financial Data automatically assigned to the columns containing these information, we need to associate these data classes to their respective term. When a data class is associated to a term, that term will be automatically assigned to any column detected as containing data matching this data class.
In our example, we will add the data classes US Phone Number, Email Address, and US Social Security Number to the term Personal Data, and will will add the data classes Credit Card Number and Routing Transit Number to the data class Financial Data.
Note that there are other algorithms in Watson Knowledge Catalog which can suggest a term for the analyzed column. These algorithms may base their suggestion on the metadata of the columns and/or previous assignments done by the user. So using a data class for detecting the term is not the only way to do this, but if you have a clear definition of how to detect data that should be associated to a term, using a data class is going to give you the most accurate term assignment, especially if the column has no meaningful name.
1.4. Implement policy using data protection rules
Next, we’ll create data protection rules to enforce our policy. We will create two different data protection rules for defining how personal data and financial data should be masked. In this example I will define the rule so that columns containing the sensitive data are masked by replacing the data with Xs. The data protection rules allow you to use different kind of masking or to restrict the access to the complete data set.
For documentation purposes, we’ll add the new created data protection rules in the policy that we created at the begginign, so that it becomes clear that these rules are defined in order to implement this policy.
1.5 Reviewing the scope
The following diagram summarizes what we have just done:
- We have identified the policy that needs to be enforced.
- That policy is implemented by data protection rules that will react on business terms.
- The terms will be automatically assigned to ingested data with the help of data classes.
From here, we have defined all the metadata which are necessary to enable an automatic discovery and governance of structured data sets. We can now start with the data discovery process itself.
2. Setup the connectivity
Now that the scope is clear and the governance artifacts are in place, we need to create a connection to the source database to ingest. Cloud Pak for Data provides a rich list of connectors to various type of sources.
In this example, we will connect to a DB2 database. We need to retrieve the source details and credential to use and enter them when defining our new connection.
3. Data Discovery
Once the connection is defined, we can start a Data Discovery job. We need to select the data connection to the source to discover, optionally specify which schema should be analyzed, and specify a data quality project to be used as staging area for reviewing the results before publishing to the catalog, as well as what to do during the analysis. In this example, we will keep a default sampling of 1000 rows per data set and do a term assignment as well as a data quality analysis of the discovered data sets.
The time needed for the analysis will depend on how many data sets need to be analyzed.
4. Review the discovery results
After a few minutes, the results of the analysis can be reviewed. On the following screenshot we see analysis details for one of the discovered tables. We can see that the columns EMAIL, PHONE1, PHONE2 have automatically been assigned to the term Personal Data because of the data class associations that we did before.
At this point we have the possibility to review in details the results and do some manual overwriting in case the suggested terms are not as expected, or go back to the definition of new data classes and terms and repeat the analysis in case we notice that something is missing.
5. Publish to a catalog
Once the review of the results is complete, we are ready to publish them to a catalog where they can be searched, found and used by the consumers.
Let us create a new catalog for the purpose of this exercise:
Note that in order to protect data access with data protection rules, the option Enforce data protection rules needs to be enabled at the time of the creation of the catalog.
Once the catalog has been created and the users who need to access those data assets have been added to its access list, we can go back to our discovery results and publish all or some discovered data sets to it.
6. Use the catalogued data sets
Now that the data sets have been published, other users can access the catalog and search and use any of the published data sets. No matter how those users try to access the data (via asset preview in the catalog, or after having added the data set to their project and working on the data set in a notebook), the platform will ensure that the data protection rules are enforced and will mask all data coming from columns identified as containing sensitive information.
This can be seen in the following screenshot, where the data of the columns EMAIL, CCN, PHONE1 and PHONE2 are masked while the other columns are shown in clear form.
Please note that the owner of the data set (the user who published those data sets to the catalog) will still be able to see the data in their original form. So if you want to see the effect of the data protection rules, you have to login to the catalog as a different user.
We have seen in this article how you can ingest data sets from a new data source into a catalog in an automated way. We have also seen how you can define the policies and governance rules in the catalog so that they are automatically applied during the data discovery. The process is repeatable and any further discovered data sources would be automatically protected by the rules that we have defined.
We have seen that once the policies and rules are identified, the process of implementing them and ensuring that they are enforced is straight forward and can be used using a single UI. Predefined industry models can be used to accelerate that process of defining the assets.
This article has been focused on the analysis of structured data sets. A similar process can be implemented for unstructured documents as well.