Use custom ML Model for automated Term Assignment on Cloud Pak for Data — Part II

5 min readNov 20, 2023

The blog shows how to use Watson Machine Learning and Watson Knowledge Catalog APIs to create a scikit-learn model and scoring function in support of automatic term assignment.

In Part I, the Deployment space, Published Business terms and Enriched data assets have been prepared and will be reused in this section.

The tasks in this blog are distributed across the following 4 steps：

Create a dedicated(empty) project
Create a Jupyter notebook editor through URL
Run the sample notebook, provide necessary input
Check the custom model, scoring function are deployed successfully

Let’s dive in for more detailed steps.

Step 1 : Create a dedicated(empty) project

The project created in Part I can be reused. However, it’s recommended to create a dedicated (empty) project for custom model.

From the Main navigation menu, select Projects > All projects. Click New Project on the new page, select Create an empty project, following the instructions on the screen.

Step 2 : Create a Jupyter notebook editor through URL

In the new project, click the New asset, click “Jupyter notebook editor” tile.

Select URL on the left panel in New notebook page, give a meaningful name, copy below URL string ( not include the double quote) and paste in the input box under Notebook URL

“https://github.com/IBM/wkc-term-assignment-samples/blob/main/cpd4.6/notebooks/term_prediction_model.ipynb”

It looks like as below after inputting all the information:

Click Create, the sample notebook will be opened in the Jupyter notebook editor.

Alternatively, you may download the sample notebook file from the repository, and create a new Notebook through Local file.

Step 3: Run the sample notebook

Once the kernel is started, you can select Cell > Run All from the menu to run all steps on the notebook. However, it’s strongly recommended to carefully read the comments, run the cells, and review their output step by step rather than running them all at once.

This is a quick overview of the steps in this notebook:

Define settings and parameters.
Create a custom library with logic for feature preparation and scoring.
Extract metadata from a Watson Knowledge Catalog project or catalog for training.
Train (and test) a model based on a scikit-learn pipeline involving the custom preprocessing library, a vectorizer, and a classifier.
Deploy the custom library and model to Watson Machine Learning.
Create and deploy a custom scoring function supporting multiple assignments per data asset.
Display the settings to enable the metadata enrichment of a project to assign terms based on the deployed ML artifacts.

Some parameters are required and may by changed to make the model more suitable for your cases.

URL of your Cloud Pak for Data cluster
ID of your Watson Machine Learning deployment space
“p” for project because training data are stored in projects in this blog series
ID of the project to be used for training
ML parameters: The parameters can be adjusted as needed. They decide how to retrieve training data, how scoring function returns result, how to select features for model and scoring function and the basic parameters to configure the scikit-learn CountVectorizer.

parameters = {
    "training": {
        "metadata_scope": "metadata_of_assigned_terms",
        "reviewed_only": True
    },
    "scoring": {
        "max_num_assignments": 2,
        "assignment_threshold": 0.4
    },
    "feature_selection": {
        "term_metadata":  ["category",   "term_name",  "term_description"],
        "asset_metadata": ["table_name", "column_name"]
    },
    "feature_mapping": {
        "ngram_range": (1, 2),
        "min_df": 0,
        "max_df":  1.0,
        "max_features": 50000
    }
}

Prepare testing data with target tables and its columns. To ensure the test provides useful results, the table names (TAB1, TAB2) and column names (CLIENT, ADDRESS) might need to be changed to values that are compatible with the training data.

test_data = [ [ "TAB1", "CLIENT" ], [ "TAB2", "ADDRESS" ] ]

Modify below parameters based on your Cloud Pak for Data version

# base_software_specification_id = wml_client.software_specifications.get_id_by_name("runtime-22.1-py3.9")
base_software_specification_id = wml_client.software_specifications.get_id_by_name("runtime-22.2-py3.10")

# wml_client.repository.ModelMetaNames.TYPE: "scikit-learn_1.0"
wml_client.repository.ModelMetaNames.TYPE: "scikit-learn_1.1"

Continue to run all cells. When all steps of the notebook are complete, open the deployment space created in Part I. The webpage looks like as below. There are two deployments, one is for model, the other is for scoring function.

The output of the cell before the last one are essential informational to enable an Metadata Enrichment asset for term prediction based on the deployed scoring function.

Deployment space: Custom model space
Deployment:       demo_tp_scoring_deployment
Input transformation code:  {"input_data":[{"values":$append([ [$$.metadata.name, ""] ], $$.entity.data_asset.columns.[[$$.metadata.name, name]])}]}
Output transformation code: {"term_assignments": predictions[0].values ~> $map(function($x){function($z){$count($z) > 1? $z : [$z]}($x[0] ~> $zip($x[1]) ~> $map(function($y){{"term_id": $y[0], "confidence": $y[1]}})) })}

Things to consider

This sample notebook will not create a model that is equivalent to the existing ML-based term assignment method. It is meant as guidance if you want to create your own custom term assignment based on Watson Machine Learning. A large portion of the code provides templates for the actual logic to be used. The actual logic depends on the metadata to be processed.

The classifier and scoring function used by this notebook are simple and straightforward implementations of some of scikit-learn capabilities. A full replacement of the built-in ML-based term assignment requires more thorough approaches.

Conclusion and next steps

Created a Jupyter notebook through the URL.
Got two deployments by running the notebook with the environment and enriched training data.

Now it’s ready to configure Metadata Enrichment to assign terms with the custom mode. Move to Part III to for next step.

Reference

Watson Data API: https://cloud.ibm.com/apidocs/watson-data-api-cpd
Watson Machine Learning API: https://cloud.ibm.com/apidocs/machine-learning
Watson Machine Learning Python Client API: https://wml-api-pyclient-dev-v4.mybluemix.net/
scikit-learn: https://scikit-learn.org/