Creating new Data Classes for IBM Knowledge Catalog

Mary O'Neill
8 min readDec 18, 2023

--

Authors: Mary O'Neill, Julie Forgo

The IBM Knowledge Accelerators provide a pre-defined set of data classes in an organized category hierarchy, to assist organisations who deploy IBM Knowledge Catalog with data classification, specifically in the analysis and classification of data, and the assignment to appropriate Business terms. These data classes can also be used in conjunction with Data Protection Rules, for example for data privacy purposes in controlling access to personal data.

Within the data privacy area, the data classes provided include a set of government issued identifiers for individuals (such as tax, social security and other national identifiers) across multiple jurisdictions.

This blog describes how to create new Data Classes in IBM Knowledge Catalog to supplement the data classes provided with the Knowledge Accelerators. For example, if your organization has business operations spanning multiple geographies, you can create new data classes for personal identifiers for your customers based in each jurisdiction.

Creating a new data class involves these steps:

1. Define the data class and specify attributes such as name, description, and primary category.

2. Add secondary attributes, such as a secondary category, examples and classifications, as needed.

3. Choose a data matching method for automatically assigning values to the data class. The choice of method depends on whether there is a distinct set of values to match to, or recognizable patterns expected in the data being analysed. This blog describes how to create new data classes using each of these matching methods:

· List of valid values

· Reference data

· Criteria in regular expression.

Create a new Data Class

  1. Navigate to Governance > Data classes, click Add data class, select New data class.

2. Populate the Name, Description and Primary Category for the new data class:

· Name: provide a unique name which summarises the type of data to be classified by the data class.

· Description: in addition to providing an explanation of the nature of the data class, it is useful to describe the data that it will be used to classify. For example, any obvious patterns or formats that should exist in the data can be described.

· Primary Category: very important for organizing the data classes in the Data Classes category hierarchy, to facilitate appropriate selection for Metadata Enrichment. You can leverage this work by choosing to apply or ignore data classes when you import and enrich the technical metadata for your data assets. This gives you a high degree of control over how your data is processed and classified with minimal set up.

3. Click Save as draft, and begin adding other properties to the draft Data Class:

· Secondary categories provide alternative grouping for data classes. Note, these are not currently used by the metadata enrichment process.

· Examples of the expected data values can be added to provide extra clarity to the data class definition.

· Data class hierarchy: Parent / Dependent data classes can be populated where appropriate — if the parent data class has a matching data method defined, the matching data method will only be checked for the dependent data classes if the parent data class returned a positive match. For example, a data class for a specific date such as Date of Birth must have a parent class of Date.

· Classifications can be used to apply very specific classifications to data, such as in the data privacy context of Personal Information (PI) and Sensitive Personal Information (SPI). These classifications should be reviewed to ensure that they align with recommended classification in accordance with the relevant data privacy regulations, such as GDPR (EU General Data Protection Regulation) and CCPA (California Consumer Privacy Act).

· Business Term(s) should be related to the data class as part of a comprehensive data governance solution. This improves the metadata enrichment process, where the business terms will be assigned to the data assets.

· Matching Method is added by clicking the + sign in Data matching section:

Choosing a matching method

To choose a matching method, consider the type of values the data class will contain.

When there is a distinct set of values that data should contain, choose a Valid Values List or Reference Data Set as the matching method. For example, for a small set of valid values (such as yes/no/unknown), the Valid Values based approach is the best method. In the case of a larger set of values that may be managed separately as reference data, then Reference Data matching method is more appropriate.

When there is no distinct set of values to work from, but there is an obvious pattern to the data being analysed, then select Match to criteria in regular expression and create a regex-based data class. For example, use this method when the data has a fixed number of characters in length, characters in a specific sequence are always alphabetic or numeric, or there is a range of predefined values contained within the elements of the data.

All three of the matching methods allow you to specify these common data class properties:

- Percentage match threshold: represents the percentage of matching data values found in the data required to automatically assign this data class. This is an optional property and can be adjusted from the default setting if required — the threshold set at project level is used if this is not set for the data class.

- Column name match criteria: use a regular expression to match to the expected column name in the data asset, in cases where the valid values or data pattern are not distinct enough on their own to prevent false positive matches during metadata enrichment.

- Column data type: specify the expected type of data to be analysed (date, number, text, etc.).

- Maximum & Minimum length of data value: further refine the match by setting the maximum and minimum length of data expected in the column.

- Matching priority: default value can be adjusted, depending on the order in which this data class should be prioritised, relative to other data classes. When data matches multiple data classes, the highest priority one will be assigned.

Details on matching methods

The following sections describe the specifics of using the various matching methods.

1. Regular Expression (regex) based matching method

When there is an obvious pattern to the data being analysed, but the full set of potential values is not available, a regex-based data class might be the best option. For example, use this method for an identifier assigned to an individual which includes elements such as birth date, coded location, gender, and checksum. For these data classes, use a regular expression to define the valid pattern of characters expected, including data length, value ranges or a specific set of characters in sequence, case sensitivity, etc.

Before you draft the regular expression, determine how the identifier is constructed. For example, use a web search to find the format of a specific identifier, and cross-check different sources, if possible. In the case of government or regulatory body issued IDs, the issuing body’s own website will usually provide detailed information about the format and pattern which can be used to develop the expression.

For example, this government-issued national identifier for individuals is comprised of:

- 6-digit date of birth in the format YYMMDD: sample regex [0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])

- an optional hyphen: sample regex [-]?

- 2 alpha character region code (upper case): sample regex [A-Z]{2}

- 4-digit unique identification number: sample regex [0-9]{4}

The regular expression can be written as: ^[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])[-]?[A-Z]{2}[0-9]{4}$

Examples: 930625-AB0123 891231XY999

For information on writing and testing regular expressions, check out these sites:

https://en.wikipedia.org/wiki/Regular_expression

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html

https://regex101.com/

Use Test value to match criteria of column value to make sure the regex works as expected. In the following example an error indicates the expression is not properly constructed for the data.

Column name match: For less complex data patterns, use a column name match to help prevent false positive matches during metadata enrichment. For example, for the identity card number, it can be useful to restrict the matches to data where the column name in the data asset matches a specified regex, such as:

(?i)^id(entity)?([ \-_])?card([ \-_])?(num(ber)?|no)$

Test column name criteria: test column name match regex by entering sample column names (see screenshot below).

Column data type and Maximum & Minimum length of data value: should be set to ensure the effectiveness of the match.

After completing the required and optional matching details, your data class looks like this:

2. VALID VALUES based Matching Method:

For a small list of valid values (such as yes/no/unknown), the Match to list of valid values approach works well.

Define ‘Data matching’ :

· Enter the first value under List of valid values

· Click Add valid value

to add more values

· Set Text matching criteria options depending on whether data matching should be case sensitive, require exact spacing, etc.

3. REFERENCE DATA based Matching Method

For a large set of values that might be managed separately as reference data, use the Reference Data matching method. For example, use this matching method for a prescribed list of codes from a messaging standard or specific application, or a universally agreed set of codes managed by an international standards authority.

This type of data class is dependent on the existence of a Reference Data Set (RDS), with the valid Reference Data Values (RDVs) defined.

Many of the properties to define for the RDS-based data class are similar to those for a Valid Values based data class, with these key differences:

· Set Matching method to Match to reference data

· Select the RDS from the list (use the search bar to locate it)

After you save the details of Data matching, the first 5 RDV codes, their corresponding values and a hyperlink to the RDS are displayed.

After you publish the draft of the data class, you can use the Relationship Explorer to review its relationships to the other governance artifacts, such as Business Terms, Classifications, Reference Data Sets and Categories.

Summary

Starting with a set of pre-defined data classes provided with the IBM Knowledge Accelerators, define new data classes to meet the data governance needs of your organization. Follow the guidelines and templates in this post to define new data classes that correspond to the data formats and patterns for your use case.

For more information, check out IBM Knowledge Accelerators and IBM Knowledge Catalog

Tags: Data Governance, Data Classes

--

--

Mary O'Neill
0 Followers

Business Glossary & Data Governance developer at IBM.