Categorize content with minimal training in Watson

Image for post
Image for post

IBM Watson Natural Language Understanding offers out of the box categorization with a taxonomy of over a thousand categories. For users looking to categorize content beyond the thousand categories; our team is excited to announce the experimental release of Custom Categories in Watson Knowledge Studio, Natural Language Understanding, and Discovery.

Model customization requires little to no training data. The only information you need is category labels and key phrases uniquely identifying each category. At the moment, the underlying algorithm for training the model uses Wikipedia.

Customer Pain Point

Let’s take a use case of categorizing a list of products like the one below.

Image for post
Image for post

To address this use case, you could build a database of all products and the corresponding categories they map to. Then use the database to categorize the above list of products. But, every time a new product is released by a manufacturer, you will have to update your database accordingly.

You can expand this problem of categorizing products to categorizing books, food types, job titles, customer reviews and so on.

Current methods for custom categorization are quite painful to address a simple task.

Problem Solved

Let’s dive into the steps for building a custom categories model that solves this categorization problem in a matter of minutes.

Open a text editor and paste the following comma separated values and save it in CSV file format.

/Laptops, laptops
/Tablets, tablets
/Cell Phones, mobile phones
/Video Games, video games
/Appliances, home appliance, refrigerator, laundry machine
/Cameras, cameras
/Cameras/DSLR Cameras, DSLR cameras
/Cameras/DSLR Cameras/Lenses, DSLR lens
/Cameras/Mirrorless Cameras, mirrorless interchangeable-lens camera

In each row, the text preceding the first comma is the category label and the subsequent text contains key phrases identifying a particular category. You can create hierarchical categories up to five levels deep. The example shows a three-level hierarchy — Cameras/DSLR Cameras/Lenses. You can add more fine-tuned keyphrases for each of the categories but this is a good start for our example use case.

To train the model, login to Watson Knowledge Studio (WKS). If you don’t have a WKS instance or an IBM Cloud account, create one for free here.

Once you are logged in to the Knowledge Studio, click on Create Workspace button on top-right and create a categories workspace. Workspaces allow you to store training data and models for specific features in one place.

Image for post
Image for post

Enter a workspace and click on Create.

Image for post
Image for post

Once you are in the workspace all you have to do to train the custom model is upload the CSV file we created in step 1. Upload the file and click Train model.

Image for post
Image for post

The model should take a few seconds to train. Once it’s done you can evaluate the accuracy by entering products in the text area on left and checking the prediction results on right.

If you try the text Sony — Alpha a6500 and Mortal Kombat 11 you should get accurate predictions of /Cameras/Mirrorless Cameras and Video Games. But if you try Google Pixel 3 you will get a prediction of Cameras. That is obviously not correct (even though Pixel 3 has a great camera :-)).

To fix this issue we are going to enhance the definition of Cell Phones by adding Android smartphones, Apple smartphones as keyphrases. The revised CSV file looks like this:

/Laptops, laptops
/Tablets, tablets
/Cell Phones, mobile phones, Android smart phones, Apple smart phones
/Video Games, video games
/Appliances, home appliance, refrigerator, laundry machine
/Cameras, cameras
/Cameras/DSLR Cameras, Digital single-lens reflex camera
/Cameras/DSLR Cameras/Lenses, DSLR lens
/Cameras/Mirrorless Cameras, mirrorless interchangeable-lens camera

Click on the Retrain model button and upload the revised CSV file to update the model. Once the model training is complete you should see the correct prediction of Cell Phones for Google Pixel 3.

Here’s the list of products along with the predicted category types.

Image for post
Image for post

This is pretty impressive considering we spent very little time creating training data for the model. All we had to do was define the category labels using key phrases.

Now that we are content with the model, we are ready to use it programmatically through NLU API. To do that, we need to link the model to an instance of Natural Language Understanding. You can also link the model with Watson Discovery but we will focus on NLU in this post. Click on the Deploy model button to select an NLU instance. If you don’t have an instance, create one for free here.

Image for post
Image for post

After clicking Deploy, note down the model identifier that is presented to you. You will be using the id in API calls to NLU.

Once the model is linked to your NLU instance, you can make API requests to this custom model by passing the model identifier as model under categories feature. Here’s an example (left side requests, right side is NLU response):

Image for post
Image for post

This experimental release of Custom Categories is supported for English language and is free of cost to use. It will work with NLU or Discovery instances deployed in US-South region. Go ahead and try out this capability if you have a categorization use case. I will appreciate any comments below on what worked and what didn’t quite work for you.

Discover More

Thanks for reading! Here are some additional resources to get started.

Watson Knowledge Studio | Watson Natural Language Understanding | IBM Watson API Reference

Written by

Product @ Splunk. Previously IBM Watson.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store