Reference Data Management in Watson Knowledge Catalog — Chapter 2

Published in

IBM Data Science in Practice

5 min readMar 1, 2021

Chapter 2: Hierarchical Reference data sets and cross walks

a person walking across a bridge in fog — Photo by Matthew Henry on burst.shopify

In the previous chapter, we learnt how having a reference data repository which every one can access is a key tool to achieve standardization across the organization. We learnt how to use the Watson Knowledge Catalog’s reference data management capability. Using this capability we learnt how to store Code, Value and Description of a reference data value. In this chapter we will explore further this capability of Watson Knowledge Catalog and learn about hierarchies in Reference Datasets. Also, we will learn about relationships between values of different reference data sets known as value mappings or cross walks.

Hierarchical Reference Data:

The Reference Data values, as learnt is previous chapter, is a list. While being a list, these values can be further organized in a hierarchical manner for easy and better structuring of the values. For instance consider the NAICS codes, in which each industry has a generic code and then it narrows down into specific type within the industry. Below image shows a sample list of codes and values representing the industry classification as in NAICS. In the image, code 11 represents the Agriculture, Forestry, Fishing and Hunting industry. code 1111 depicted under 11 represents a specific sub-industry, in this case Oilseed and Grain Farming. So on 111110 represents sub industry within 1111, in this case Soybean Farming.

All of these values are valid NAICS codes and follow a structure to represent the entity and NAICS codes are hierarchical in nature.

Let’s import these values into the Watson Knowledge Catalog and see how it looks like

1. Create a reference dataset named NAICS

2. Import into the set NAICS values from a CSV file

screenshot of importing NAICS codes via CSV into Watson Knowledge Catalog — Import values from CSV file

After selecting the file to upload, specify the code, value, and description columns. One important thing to take note of here is that we must specify a parent column. You can use the parent column to specify which value within the set forms the parent value to represent the selected value and the parent value in a hierarchical manner. Below image shows the selection of the parent column while importing the CSV file.

screenshot of designating columns to which roles they play in the dataset — Choose columns during import

On importing the values, the reference data set will look as it does in the image below. In the image, we can see that the values are displayed in a hierarchical manner, with hierarchy expanding to the right, as accessed, in the left panel of the screen.

screenshot of hierarchical values within a reference data set — Hierarchical values in Reference Data Set on WKC Platform

As you can see, the values are in a hierarchical manner for easy navigation through the industry information it represents.

Similar to structuring values in a hierarchical way, we can represent the reference data sets themselves in a hierarchical manner. To build the hierarchy on the reference data set, access the tab Set-Level Hierarchy and get started with adding parent and child reference data sets.

The image below shows the Countries reference data set having two child reference data sets, namely Indian states and Currency codes.

screenshot of a reference data set parent and children — Set Level Hierarchy

Reference Data Set Cross Walks:

While reference data sets can be represented in a hierarchical manner, another interesting and useful representation is to associate values in one reference data set to values in another set. This type of association is called value mappings aka cross walks.

These mappings are useful to represent relationships between values. These mappings typically can be used to represent the relationship between reference data values.

For example, we can map value INDIA from countries data set to each of the states with the Indian states reference data set. Like-wise we can map with INDIA the currency code INR from the currencies reference data set. In this example country INDIA and the Indian states form the nodes of the relationship graph with relation between them forming the edges of the relationship graph.

Let’s try this out in the Watson Knowledge Catalog.

To associate one value with another, access the Related Values section of the selected value and try adding a relation.

screenshot of adding a relation to a selected value — Adding Related Values

On the dialog that opens up, choose the reference data set from which to associate a value.

screenshot of choosing a reference data set — Choosing target Data Set while adding Related Values

On the next page, choose the type of relationship to be used:

One-to-One : Choosing this type of relationship will ensure that only one value from one data set is associated with only one value from another data set.
One-to-many: This option maps one value from a chosen reference data set to multiple values from another reference data set, creating an m:n relationship between the two data sets.

screenshot of choosing a one-to-many data mapping — One-to-many relationship value chooser dialog

The related values will show up as in image below. In the image below, along with values and other attributes of the reference data set, Related values show up in the middle panel.

screenshot of reference data sets with related values to value IND — Referece Data Set with Related Values for code IND

You can import these associations also via CSV file from the menu Upload Related Values. In the dialog box that opens up, choose the appropriate columns and target reference data set to complete the proper mappings for respective values.

screenshot of importing related values from a CSV file — Importing Related Values from csv file

In this chapter, we learnt about hierarchies and cross walks in reference data sets. Along with this, we walked through how to represent this in Watson Knowledge Catalog’s reference data management system.

In next chapter, we will look into custom columns support which will help represent additional information about a value. Also, we will learn about how to access the reference data management system via API.

Reference Data Management in Watson Knowledge Catalog — Chapter 2

Written by Praveen Devarao