Reference Data Management in Watson Knowledge Catalog — Chapter 1

Praveen Devarao
IBM Data Science in Practice
5 min readFeb 26, 2021

Chapter 1: Introduction to reference data in Watson Knowledge Catalog

list of country names with numbers next to them
Photo by Martin Sanchez on Unsplash

One of the challenges organizations face is data standardization across different departments and sub-systems. Multiple departments use data which is referring to the same entity but in different forms. One of the main reasons for this is the non-availability of global reference data which everyone can use. Having a reference data repository which every one can access is a key tool to achieve standardization, due to this leading to all modifications occurring in a central repository, going through the right level of reviews and available to all departments (sub-systems) in the same format at same time.

In this chapter, let’s look at what reference data is and managing the same in Watson Knowledge Catalog [WKC]. WKC is a data catalog integrated with data governance capabilities. It provides tools and constructs for a self-service data governance model. WKC helps users to discover, curate, categorize and share data assets, data sets, analytical models and their relationships with other members of your organization. Having reference data managed in this platform enhances the possibilities one can achieve, like defining data quality constructs to ensure data sets are using the right data, making the reference data sets find-able easily, and moving them through the same level of reviews as any other governance artifact.

To clarify, reference data is a collection of values used for the categorizing or classification of other assets within an organization. This collection of values are static in nature, i.e., they do not change frequently. Managing this collection in a global repository is used to achieve standardization across the organization, which in turn helps all the sub-systems follow the same terminology. This in turn yields to the same understanding across the organization for all data. Many of us are familiar with some examples of reference data sets such as ISO Country codes, ICD10 Health codes, NAICS codes, etc.

To access the reference data management feature in WKC, login to your IBM Cloud Pak for Data instance and from the left hand navigation bar access Reference Data under the Governance section.

screenshot of IBM Cloud Pak for Data dashboard
IBM Cloud Pak for Data dashboard

On the Reference Data page, you will get to see a list of all published reference data sets and a list of draft reference data sets defined in the system. To start with, these lists will be empty and one can create a new reference data set from the button `Add Reference Data set` -> `New Reference Data Set`

screenshots of lists of reference data sets
Reference Data Sets Listing page showing publish and draft artifacts

At minimum, you must key-in the name for the new reference data set and select the primary category.

Extras: Category is like an operating system folder under which you can organize different governance artifacts of WKC. Along with organizing artifacts one can provide permissions to a user or group of users on the category which will implicitly apply to all artifacts within it. If interested in learning more, please read this post on using categories to manage governance artifacts.

On the creation of the data set, you will be directed to the created set into which you can start adding the values. The created data set will be in draft state which you can publish once it is ready for consumption by other users on the system.

screenshot of a reference data set draft copy
Reference Data Set Draft copy

Extras: WKC by default provides workflow support on all governance artifacts. Using workflow, you can work on a draft copy and publish it for consumption by other users.

You can add values individually via the edit menu available on the page or import the list of values from a csv file.

Each value within the set will have three primary fields: code, value and description. Code identifies the value uniquely, for e.g. unique ISO code of the Indian state in the Indian-states data set. Value is used to specify a quick gist of what this code represents, for e.g. in Indian states this could be name of the state. Description is the field that captures any elaborate information related to the value.

If importing from the csv file, select which columns of csv map to the code, value and description of the value list within the set and click on save.

screenshot of upload a csv
Uploading a csv for import into Reference Data set

The format of a csv file will be as in the image below. The first row in the image below is a header and will not be imported.

csv with a field titled code that are ISO values for Indian geopolitical bodies, a field titled name that are names of Indian geopolitical bodies and a field titled description that describes if they are a state or union territory.
csv file format to import data from

On importing from the file or adding values individually, the reference data set will look as in the image below. The image shows the reference data values on the left hand panel with the search bar to be able to search for values within the set. The right panel contains information about the user who created the data set, effective dates for the set, etc., and the middle panel contains details of the value selected in the left panel.

Reference Data Set with value imported from a CSV

The image above shows the workflow buttons Delete Draft and Publish. Using these buttons, one can delete the draft set or publish the reference data set so as to make it available for other users of the platform.

You can associate the reference data values or the entire data set to other governance constructs like Business Terms, classifications, as well as others. Similarly you can assign stewards, set effective dates and assign tags for easy findability.

In this chapter, we learnt what is reference data and how to create, modify and publish it on the Watson Knowledge Catalog platform. You are now ready to create your own reference data set in a step towards achieving standardization of data usage in your organization.

Using the above foundation knowledge we will look into creating hierarchical reference datasets and cross walks for better organization and representation of reference values in next chapter.

--

--

Praveen Devarao
IBM Data Science in Practice

CMTS @ Oracle Cloud, previously Software Architect @ IBM India Software Labs