Train Better with NLU Classifications

Published in

IBM Data Science in Practice

3 min readSep 13, 2022

IBM Watson Natural Language Understanding (NLU) recently released Single Label Classification as a feature to our customers. The service now treats classifiers trained with single-labeled datasets as candidates for normalization of scores. Customers can choose to either train a single label classifier or a multi label classifier with the supplied single-labeled dataset by including a supported map of training parameters along with the request.

In addition, NLU supports training custom classifiers with multi-labeled datasets and for any predictions done against such models trained, scores do not sum to 1 since an input could belong to more than one class and each classes’ scores are representative of their respective alignment with the input text. On the other hand, single label classifiers always assume the text best belongs to one class and normalizes the confidence of each class to ensure the total confidence values sum to 1.

Confidence Scores comparison — Single Label Classifier v/s Multi Label Classifier

This feature is designed to ease customer migration from the Natural Language Classifier (NLC), which was deprecated on Aug 09, 2022, to NLU Classifications, where an application might require the normalization of confidence scores as an expected behavior for training a classifier with single labeled datasets.

NLU is always iterating to improve the quality and performance of the models that our users can train and utilize. Today, we are proud to announce that there will be significant performance improvements coming to NLU classifications, where we observed speed-ups of up to 6x in training times and 4x in inference times*, along with faster model loading times, enabling our users to train lighter, faster and better models. We tested these improvements with our customers that have documents in Japanese as a proof of concept, and our customers reported that their training times got reduced by 50–70% compared to NLC. Now, we will be bringing these improvements to all the languages that NLU classifications supports!

Together with improved models, we also wanted to share some of the best practices that we have learned from working with our customers:

Dataset size is crucial for our models to learn well. The more data points, the better!
A similar number of examples per class would be ideal. But we are aware that imbalanced datasets exist in the wild. Therefore, we recommend to adjust the threshold of prediction accordingly!
“Adjust the threshold of prediction” means discarding the predictions with a confidence score lower than a predetermined value. You can determine an optimal value for your use case by finding the value that maximizes the metric of your choice in the test set. This applies for both single-label and multi-label models.
If it is more important that the model is precise in its predictions (e.g false positives have a high cost), then we’d recommend a higher threshold of prediction.
If it is more important that the model identifies as many as the correct examples as possibles (e.g false negatives have a high cost), then we recommend a lower threshold of prediction.
Generally, the f1-score is a better metric than accuracy to measure performance on imbalanced datasets!
Make sure that you know whether you are working with a single-label vs. multi-label dataset, as the confidence scores will add up to 1 in the former case but not in the latter.
From time to time, take a look at random samples of the test data and the predictions the model makes, as opposed to only guiding yourself by a given metric. Sometimes we have encountered that our customers are concerned about low scores in the metrics used to evaluate the model, but then the examples in which the model failed are hard to label even for a human!

If you want to play around and discover what you can do with NLU classifications, check out this demo Juypter notebook!

Model training and Inference Demo with NLU Classifications

*from internal testing. We don’t guarantee that these exact multipliers will be reflected in each usage.

Train Better with NLU Classifications

Some helpful resources

Written by Ashwin Goyal