Chatbot Validation Methods

Published in

IBM watsonx Assistant

5 min readFeb 2, 2018

Wait, why is there a cute pug at the top of a Chatbot Validation blog post? Answer: click bait. Guilty as charged.

Chatbots have come a long way since ELIZA. NLP techniques have changed from rule based approaches to statistical ones that provide better accuracy for understanding a user’s intent. However, this also means the bot can have less explainability or predictability. A change in the training data can directly impact the response given to a user. Because of this, production chatbots need to have robust testing methods in place to ensure accurate responses are given to users. I’m sure you know all of this, so I’ll cut to the chase: there are three main ways to perform chatbot testing, namely the 80% / 20% split, K-Fold Cross Validation and Monte Carlo Cross Validation.

80% / 20% Split

The 80% / 20% split is the most basic approach and probably the most common. Users hold back 20% of their GT (Ground Truth, or all the data points for the chatbot) rather than training with the entire GT. Then, after major changes are made to their development chatbot, they use the 20% GT to test the accuracy and ensure nothing has regressed since the last update. Accuracy of a chatbot can be defined as the percentage of utterances that had the correct intent returned.

Imagine we have a chatbot that talks about dogs (since dogs *are better* than cats). I have this chatbot in production already and it has 10 intents. I’m about to release v2 of the DogBot, which changed my GT a bit and added two more intents. I promote my development version to my staging environment, and run my script that sends in each utterance from my held back 20% GT matching the returned intent to the expected intent. If the overall accuracy is the same or greater than my production bot from the last time I ran the tests, then my new changes have the green light to be promoted to production.

Why use 80% / 20%? It’s simple and easy to implement. However, it can be less accurate than the other two options. Also, you will end up not using your full GT in your production bot so the bot isn’t biased during testing.

K-Fold Cross Validation

The second method, K-Fold Cross Validation, divides the training set (GT) into K number of parts (folds), then uses one fold at a time as the testing fold and the rest of the data as the training data. The most common test is 5-fold, but you can choose whatever number you’d like. This means the training data is split into five folds. The bot is trained with four of the folds, and the fifth fold is used to test (just like the 20% GT in the above example). Repeat this so each fold has a turn as the testing fold. Afterwards, average the overall accuracies of the folds together to get the accuracy of the chatbot.

K-Fold is better than 80% / 20% because it gives you five times the number of tests, making it more robust. However, this is a bit more complicated to implement since you have to create each fold.

Monte Carlo Cross Validation

Monte Carlo is similar to K-Fold except the data sets are determined randomly. Shuffle the data randomly, pick the first 80% as training and designate the rest as testing data. Or, another way is to randomly split the data into an expected ratio. Then, you have one dataset for testing, and you repeat the shuffling multiple times so you can get as many splits as you want. Generally, the data sets are shuffled five times giving five accuracies to average together.

This provides the most random data sets of the three options, but the different splits could contain overlapping data since it is randomly selected.

Implementation of Cross Validation

We will be using Watson Conversation to build our bot and test its accuracy. To demonstrate, we will be using the demo workspace that comes with Watson Conversation: The Car Dashboard. There are 27 intents in this workspace (see the demo).

The first step is to get our data in CSV format. Log into your workspace and go to the Intents page. Click the checkbox next to the Intents title to select all of your intents and export (shown below)

That’s all we need to do for our data! The script will handle the rest once the data is in this format. For this script, we are mixing Monte Carlo with a normal K-fold test. The data is shuffled before the folds are created (Monte Carlo) but the folds are still unique (Normal K-fold).

You can find the code for the script here: https://github.com/ammardodin/conversation-cross-validation

Clone the repo to your computer and open with an editor. Open conversation_cv.py and add your Conversation credentials on line 66.

Make sure you have enough space in your instance before proceeding. This will create a workspace for every fold you create.

Next, we need to download our dependencies. I use virtualenv for this, meaning the first step is to create the environment with virtualenv ENV then activate it with source ENV/bin/activate. Download the dependencies with pip install -r requirements.txt.

Make sure to move the exported CSV to the same folder as the repo you clone. Then, run this command: python kfold.py --data data.csv --folds 5. The data flag will allow you to specify what dataset you are uploading, and the folds flag will allow you to choose how many folds you want to split your data into. A good default here is 5.

The script will print to your terminal the accuracy of a particular fold while the others are used for training, as well as the averaged accuracy. Here’s the result of my test using the Car Dashboard workspace and 5 folds:

Accuracy: 0.8929
Accuracy: 0.8889
Accuracy: 0.9147
Accuracy: 0.9167
Accuracy: 0.9127
Average accuracy: 0.9052

Woo! Well done.

Conclusion

Using either cross validation method will provide a more accurate measure of how well your chatbot is performing. The major difference between Monte Carlo Cross Validation and K-Fold is you don’t have to split it into 5 fold upfront and combine them in combinations. The random shuffle is in theory less work, and you pick the top N% as train data and the rest as test.

Every time a decent sized change is made to your intents, run this script in your staging environment to double check your accuracy before proceeding to production. If you want to add on to the script, it’d be helpful for it to list what intents were missed the most so you can easily debug.

Special thanks to Ammar Dodin for writing the script.

Happy coding!