Data Governance : Exploring the paradigm with Watson Knowledge Catalog — Chapter 3

Praveen Devarao
IBM Data Science in Practice
7 min readApr 14, 2021

Chapter 3: Data Classes

a lot of professional grade camera flashes laid out on a surface
Photo by Taylor R on Unsplash

In Chapter 1, we learnt what a data catalog is and how it plays a pivotal role in data governance.

In Chapter 2, we learnt and explored the data governance constructs Business Terms and Classifications using the Watson Knowledge Catalog capabilities.

In this chapter we will learn about Data Classes and explore using it in Watson Knowledge Catalog.

Data Classes

One of the key features that a data catalog should provide is self-service data governance. In order to achieve this, the data catalog should exhibit the capability of being able to understand data housed within the assets. Using this understanding the catalog will be able to automatically assign classifications to an asset. Based on these classifications, one can define policies to define who has access to the asset and who does not.

Data Classes are labels with associated logic. This logic enables the system to apply labels onto a data asset based on whether the logic evaluates to true or false. When a data asset is added into a catalog several things can happen. First, an automated background process would be kicked off to analyze the data within the asset and then this would enrich the metadata information of the data asset with the special logic labels. We will dwell further into defining these logic labels and seeing them in action in Watson Knowledge Catalog.

To explore Data Classes in Watson Knowledge Catalog, access the Data Classes menu from the left side of the landing page of the Cloud Pak for Data platform. The image below shows the Cloud Pak for Data dashboard with expanded Navigation Menu. Under the Governance section of the menu, the link to the landing page of Data Classes is highlighted.

screenshot of the Cloud Pak for Data admin page showing the left hand navigation menu with “data classes” highlighted in the menu under the “Governance” hierarchy
Cloud Pak for Data dashboard with expanded Navigation Menu

On entering the landing page, you will find a list of pre-defined data classes. These are provided out-of-the-box by Watson Knowledge Catalog. The data class names are self-descriptive to indicate what type of data the data class identifies and classifies accordingly.

The image below shows the data class landing page showing the different out-of-the-box data classes like Account Number, Address Line, Airport Code etc.

a screenshot of pre-defined data classes in Watson Knowledge Catalog such as Account Number, Address Line 3, Address Line 2, Address Line 1, Airport Code, Alabama State Driver’s License, and Alaska State Driver’s License. There is also an “add data class” button in the upper right corner.
List of Data Classes

While these out-of-the-box data classes cater to generic requirements, there will be a need to define one’s own data classes so that the data catalog will be able to recognize organization-specific data and classify the assets accordingly if data within an asset matches the specified evaluation algorithm.

Let’s explore the creation of a custom data class.

To create a data class, click on Add Data Class button on top right corner [see image List of Data Classes] and select New data class. Key in the new data class name and choose the category in which the data class needs to be created. Click Save as draft to be navigated to the draft page of the newly created data class.

The image below shows the draft page of a newly created data class. The image shows sections on how to update the description, the name, and other general properties of the data class. You can add an example of data that matches what should be included in this data class in the Examples section. The image also contains a section titled Data matching which we will discuss next.

screenshot of a draft page for a new data class called “Myorg_id_data”. It has two buttons “delete draft” and “publish”, and two tabs “overview” and “related content”, with the overview tab being shown. This tab contains the section “general” with fields “description”, “examples”, “primary category”, and “secondary categories”, the section “details” with field “ca text 1” and the section “data matching” with fields “parent data class”, “dependent data classes” and “matching method”
Data Class draft page

As noted above, Data Classes are labels with logic. In the newly created data class, we can specify the attributes related to the logic in the section Data matching . Two main attributes we will focus on is the Parent data class and the Matching method.

  • Parent Data Class: Here you will need to select a data class which will be a parent to the newly created data class. The hierarchy here enforces that data must first pass the logic of the parent data class and only then will the child data class logic be evaluated. For example, in the out-of-the-box data classes there is a Credit card number data class and Amex credit card number data class. These two data classes are organized in a hierarchy with the Credit card number data class being a parent of the Amex credit card number class. A data can match an Amex credit card number only if it first is validated as an acceptable credit card number.
  • Matching method: This is the portion where we get to select the logic for the data class. The image below shows the dialog box from which we can select the matching method. The different matching methods in the image are described below.
a screenshot of the “data matching” dialog box. There are three options with radio buttons, “select matching method”, “define data matching” and “other matching criteria”. “select matching method” is selected and there are a choice of matching methods shown: “no automatic matching”, “match to list of valid values”, “Match to reference data”, “match to criteria in regular expression”, “match to criteria in deployed Java class” and “other matching criteria”. “no automated matching” is selected
Different Data Class Data Matching methods
  • Match to list of valid values: On choosing this method, you can provide a list of values and the algorithm tries to see if the data matches one of the values in the given list. If a match is found, the asset would be labelled as containing data of the type this data class contains.
  • Match to reference data: This option is similar to the above list of valid values except that the list is derived from one of the Reference Data Sets defined in the WKC platform. The advantage of this is that the list is independently updated as per the requirements of the organization and the same is reflected in the data class matching logic as well.
  • Match to criteria in regular expressions: With this option, you will need to specify a regular expression which matches the intended data. At run time, the platform will check if the data in a given asset matches this regular expression and if it does, it will mark the asset with the data class.
  • Match to criteria in a deployed Java class: In this case, you get to choose a Java class which will contain the needed algorithm to match a given data. This Java routine is executed at runtime to determine if it returns true or false based on which label application is determined.
  • Other Matching criteria: In this case, you can choose other factors such as a column name in a data asset by which one can determine the matching factor rather than the data itself. That is, here label logic is purely based on metadata of the column like name of the column and type of the column only rather than the actual data.

For demonstration purposes, in this post, let’s choose to use the Match to criteria in regular expression and demonstrate how this works in the following images and section. First, type in the regular expression. Next, choose any other criteria that needs to be matched like column name and column data type. Then, click Save to complete the choice of matching method.

The image below shows a screenshot of the Matching method to criteria in regular expression. The regular expression to be matched, abc1*23+, is in the text box, and the matching expression must match 100% as specified with the Percentage match threshold field.

a screenshot of the data matching dialog box with “define data matching” highlighted. The “match to criteria in regular expression” dialog is displayed, with the regular expression “abc1*23+” shown in the text box with the percentage match threshold set at 100%
Regular expression specification dialog box

Move the draft data class through workflow process, as discussed in this previous chapter for other constructs, and then publish it.

Once the data class is added, try adding an asset containing a column with data matching the above regular expression to a catalog. You will see that the asset, after being profiled, will have the data class assigned to the column with data matching the above regular expression.

Conclusion

In this chapter, we learnt what data classes are and explored how to create a custom data class in Watson Knowledge Catalog.

With this chapter we also conclude the series.

In this series we learnt

  • What is data governance
  • What is data catalog and how it plays a pivotal role in data governance
  • What is Business terms and Classifications
  • What is Data class

We also explored other posts which talk about other governance constructs namely Reference Data, Data Protection Rules and Workflow to get a complete overview of data governance.

We explored all these features first hand using the Watson Knowledge Catalog capabilities.

P.S: Give the data governance capabilities of Watson Knowledge Catalog a try to get hands on experience and leave your feedback or questions in the comments section of the post. We will be glad to answer your queries.

--

--

Praveen Devarao
IBM Data Science in Practice

CMTS @ Oracle Cloud, previously Software Architect @ IBM India Software Labs