Use custom ML Model for automated Term Assignment on Cloud Pak for Data — Part I

Steven Huang
5 min readNov 20, 2023

--

AI and ML

Introduction

In this blog series we will take you through how to prepare enriched training data in Watson Studio, use Watson Machine Learning (WML) and Watson Knowledge Catalog (WKC) APIs to create a scikit-learn model and scoring function in your Cloud Pak for Data. Then Metadata Enrichments can be configured to use the model instead of the built-in Machine Learning model in support of automatic term assignment.

This series is divided into three parts:

  1. Part I: Prerequisites and Preparation for training data
  2. Part II: Development and deployment of ML model
  3. Part III: Prediction of terms by ML model

The blog only covers Part I which is used to prepare environment and training data. Click Part II and Part III to access additional related content.

What are business terms?

Business terms are used to standardize the definitions of business concepts so that the data is described in a uniform manner across the organization. Business terms can be used to annotate columns with different column names, all of which have the same type of data as defined by the business term.

Automatic term assignment

You may assign business terms manually by editing the data asset properties in a project or a catalog, or when you work with enrichment results.

Demonstrate how to manually add business terms

Automatic term assignment is the process of automatically mapping business terms to data assets and asset columns. Terms can automatically be assigned to data assets and asset columns as part of metadata enrichment.

There are 4 term assignment methods:

  1. Linguistic name matching
  2. Data-class-based assignments
  3. Build-in Machine learning
  4. Custom ML model (or Custom service)

Prepare environment and training data

To get a custom model, we have to complete some prerequisites, including deployment space, business terms, manual assignment and training data. After you complete below steps, it’s ready to develop a custom model. Use the same user ID that you plan to use for the notebook to create custom model, and Metadata Enrichment to assign terms automatically.

Step 1 : Create a deployment space

Deployment spaces is the place in which models and deployments reside. You need one to save the customer model and deploy it which will be implements in Part II.

  1. From the main navigation menu, select Deployments
  2. Go to the Spaces tab page
  3. Create a deployment by clicking New deployment space. Following the instructions on the screen, Select “Development” stage, click View new space at the last step.

In this tutorial, “Custom model space” was created and looks like below. There is none of assets and deployments till this step.

New deployment space

Step 2 : Review published business terms

Verify that your user ID has access to published business terms in Watson Knowledge Catalog (WKC). Only published business terms can be assigned.

  1. From the Main navigation menu, select Governance > Business terms.
  2. Verify that terms are listed on the Published tab.

Normally, Administrators or Data Stewards import predefined business terms, and other governance artifacts for you once Cloud Pak for Data instance is deployed. Ask administrators or Data Stewards to import IBM Knowledge Accelerators for you, or you create business terms as you need.

Step 3 : Create a project or catalog

A catalog or project with data assets which has terms assigned is required. If you already have catalogs or projects in which data assets are assigned correct business terms, this step can be skipped.

This blog uses a project for the demonstration.

From the Main navigation menu, select Projects > All projects. Click New Project on the new page, select Create an empty project, following the instructions on the screen.

New project

Step 4 : Data assets

Add training data assets in the project through Adding data to a project. The assets can be local files, connected data assets added from connections or Metadata Import, or imported from catalogs, etc. It’s better to choose data that is as similar or close to the final test and real data as possible.

Data assets are added into the project

Step 5 : Enriched data assets

Create a Metadata Enrichment asset. It will use the Build-in models to pre-assign some business terms

  1. In the project, click New asset > Select Metadata Enrichment tool
  2. On “Define details” page, assign a name, such as “Pre-assign business term”, then Next
  3. On “Data Scope” page, select all data assets in the project
  4. On “Enrichment objective” page, ensure “Assign terms” is checked, also check “Profile data” tile. Keep this page open
  5. Click “Selects categories”, select all categories which contains your business terms, such as “[uncategorized]”, “Industry Accelerators”(if has), etc. Click Next
  6. Click Next, and Create button.

The new created Metadata Enrichment creates a new job under the hood, and starts the job automatically. Once the job completes, a notification will be popped out on right upper corner.

Notification of job completion

Also, some business terms are assigned to data assets

Enriched data assets

and/or columns

Enriched columns

Please note the “Review status” are unchecked by default after Metadata Enrichment completes.

Step 6: Review business terms and mark them as reviewed

In most cases, business terms may not be assigned, assigned incorrectly, or inappropriate by Metadata Enrichment. Users need to review them to ensure proper business terms are assigned.

  1. Click the “three dot menu” on the right side > View (asset or) column details > Governance on the right panel
  2. Edit business terms to add correct business terms, or remove incorrect business terms
  3. Once completing review/edit terms, click the “three dot menu” > Mark as reviewed

You will see the “Review status” are selected.

Conclusion and next steps

  1. Introduced Business terms and automated term assignment
  2. Created a deployment space
  3. Created a project and imported data asset
  4. Run Metadata Enrichment to enrich the data assets.
  5. Reviewed assets and marked them as reviewed.

Now it’s ready to develop a ML model with the enriched data assets as training data. Move to Part II to for next step.

Find more information at https://www.ibm.com/products/cloud-pak-for-data

--

--