Testing Strategies for Chatbots (Part 1)— Testing Their Classifiers

Andrew R. Freed
IBM Data Science in Practice
5 min readSep 24, 2018
Is your chatbot sorting user utterances into the right intent categories?

Artificial intelligence systems are ultimately software systems and all software systems need to be tested. AI systems require new testing approaches. In 2016 I wrote generically about testing cognitive systems, in this post I focus specifically on testing chatbots.

There are two major areas to focus on when testing your chatbot. The first is assessing the performance of the classifiers in the system. The second is testing that any branching logic is appropriately routing users through the dialog nodes and updating state as needed. The first is a “unit test” of the intent classifier (covered in this post) and the second is a “unit test” of the dialog routing logic (covered in the next post).

Conversational intent testing

Chatbots like Watson Assistant are trained with “ground truth”, a set of sample utterances that are marked with target intents and entities labeled by subject matter experts. We assess the training performance of the chatbot’s classifier by submitting a collection of utterances and examining whether the chatbot’s classified intent matches the intent from ground truth.

Unlike traditional unit testing we are not expecting 100% performance from our classifier. Indeed, if we had 100% performance our classifier would surely be overfit! Instead our goal from testing the classifier is to find its strengths and weaknesses. We explore the weaknesses to find patterns and we use these patterns to help improve the classifier through either adding new ground truth or modifying the intent classification scheme.

WA-Testing-Tool is an open source tool for testing Watson Assistant workspaces, it is available at https://github.com/cognitive-catalyst/WA-Testing-Tool. It tests the workspace’s classifier using K-folds testing over the ground truth, which iteratively breaks the ground truth into training and blind/test sets. (The training set is used to train the model and the blind/test set is only used to test the model.) The tool produces several reports for examining the classifier’s performance and these reports give you the data you need to find patterns in classification errors, helping you improve your classifier.

Improving classification in a sample workspace

In this post we will explore the classification performance of a chatbot classifier trained on sample utterances from the Watson Assistant content catalog (full Watson Assistant workspace: test-workspace.json). Feel free to follow along testing your own workspace.

To run the tool, first create a config.ini file (use the config.ini.sample as a template and plug in your authentication variables), then run the following commands:

python run.py -c config.ini
python utils/intentmetrics.py -i ../data/kfold/kfold-test-out-union.csv -o ../data/kfold/kfold_intent_metrics.csv

After the K-folds test is completed there are a set of outputs for review:

· Summary of classification performance by intent

· List of correctly and incorrectly classified utterances with confidence

· Overall accuracy number

Figure 1: Intent summary report (kfold_intent_metrics.csv) from Watson Assistant testing tool on sample workspace

We use the output to identify patterns of errors. Our first concern is evaluating the intents themselves. We start by looking at the classification performance of each intent and we sort the report in ascending performance. This shows us the intents with the most classification errors. We start with the worst performing intent and work iteratively down the list. Figure 1 shows an example intent summary report (kfold_intent_metrics.csv) sorted in increasing quality.

When we know a couple of intents to improve, we move on to the detailed report. We first sort the report in a useful way (see Figure 2), then we are able to apply filters or simply scroll to narrow into regions of interest.

Figure 2: Suggested sort options for K-folds report (kfold-test-out-union.csv)

Most generally we take the intent we want to improve and filter the entire utterance classification result list with that intent. The first thing we are looking for is intents that are commonly confused for each other. This can be found by sorting the “predicted” column and noticing which intent(s) are most often incorrectly predicted for the intent we are focusing on. If two or more intents are close to each other the intent/entity system you can choose either to revise to more easily distinguished intents, or you can provide additional training data for these intents. Figure 3 shows the kfold-test-out-union.csv file filtered or the worst performing intent. Here we can quickly see that Redeem Points is often confused for Loyalty Status and Transfer Points, suggesting a need for further refinement of those intents or additional training required.

Figure 3 Report (kfold-test-out-union.csv) filtered on worst-performing intent

Our second concern is to review the individual utterance results looking for specific patterns of errors that do not cluster around one intent pair. This can be an eyeball test as we are looking for any other patterns we can find. As is the case above, we can generally improve the Watson Assistant classifiers by refining our intent/entity structure or by providing additional training data.

Continuous improvement of the classifier

The data above can also compute an overall classification accuracy score, which should be treated as an interesting but not critical data point. (Your target metric should be a business-relevant metric like “chat completion time” or “user satisfaction”, not “classifier performance

Classification performance will eventually reach a point of diminishing returns where it may take twice as long to add another 2–3% of accuracy. Too much focus on classification performance can lead to overfitting, where the system performs perfectly on what data it has been trained on but cannot generalize to new and unseen data. You should certainly not target 100% classification performance. You will however want to continuously monitor your classifier’s performance, especially as you add more training data.

The example above described a single classifier improvement cycle. In any given improvement cycle we will probably only target a couple of intents, ideally the lowest-performing ones or the ones used the most in our application. We anticipate that improving some intents may cause other intents to decrease in performance, so updating your training in an incremental fashion is a great idea. Expect to do several improvement cycles as your train your classifier. While doing this be sure to version your training data as well so you can track the performance of your system, via its training data, over time.

The classifier is good enough, now what?

When you are satisfied with your ability to classify user utterances into intents, you can now focus on testing your ability to successfully route a user through one or more conversational steps. Testing the conversational dialog routing logic requires a completely different set of tools. We will explore these tools in the next post: testing dialog routing logic.

--

--

Andrew R. Freed
IBM Data Science in Practice

Technical lead in IBM Watson. Author: Conversational AI (manning.com, 2021). All views are only my own.