Why Overfitting is a Bad Idea and How to Avoid It (Part 2: Overfitting in Virtual Assistants)
In the previous post we looked at a simple demonstration of overfitting in a classic regression scenario. In this post I will demonstrate overfitting in the context of virtual assistants, or as some call them “chat bots”.
What does accuracy mean for assistants?
Before I deep dive on overfitting I want to briefly discuss how we talk about accuracy of assistants. An assistant is backed by a classifier that maps utterances into intents. For instance, the utterance “where is your new store opening” could map to a #Store_Location intent. If instead the assistant predicted the #Store_Hours intent we could say the assistant (or it’s classifier) made an error.
Watson Assistant tests an utterance against all possible intents, produces a confidence score for each, and by default selects the prediction with the highest confidence. In this example we could imagine #Store_Hours having 0.7 confidence and #Store_Location having 0.6 confidence.
The simplest accuracy method is to simply measure “right” and “wrong”. “where is your store opening” == #Store_Hours is wrong. However it is more useful to consider this a combination of two errors. The first error is that the wrong answer, #Store_Hours, had the highest confidence. The second error is that the right answer, #Store_Location, did not have the highest confidence. This subtle distinction from a binary right/wrong gives us additional understanding of the error and how to address it.
More generally we can describe the first condition (wrong answer has highest confidence) as an error of precision. The second condition (right answer does not have the highest confidence) is an error of recall. Every classification error by the assistant has both a precision error and a recall error. In this post I will show how most virtual assistant developers focus on the “recall” errors without considering the “precision” errors.
What does overfitting look like for assistants?
In my previous post I used a simple example so that I could draw it in two dimensions. The classifiers behind virtual assistants use hundreds or thousands of dimensions, which I cannot draw and humans can’t visualize. I will use an illustrative technique to reduce the classifier training data to two dimensions. (Mathematically one could use Principal Component Analysis, however that is both far beyond the scope of this post and is not required for understanding the general principles.). I’ll assign a green plus to #Store_Location, a purple asterisk to #Store_Hours, and a blue X to #Contact_Us.
The simplest model I can imagine is one using ellipses to identify classifications.
Like our simple linear regression model in the first example this model seems to quickly grab the essence of our training data. There are clear regions centered in the lower-left, top-middle, and bottom-right of this diagram. We see that most examples are correctly predicted by this model (ie, the green plus markers are almost entirely contained in the green ellipsis) but we also see some errors. The errors are largely located around the edges of the ellipses suggesting the model is not very confident about those predictions.
This is the most accurate prediction we can make on the training set using a simple model. Now let’s consider a more complex model, one that considers an order of magnitude (or several orders) more input features.
Similar to the regression example of the first post, it is possible to achieve perfect accuracy on the training set by allowing a more complex model. We once again note some suspicious aspects of this model, most notably the “fingers” and “elbows” that appear. The boundary between green and purple (top-middle and bottom-right) regions is especially suspicious — the lack of a clear boundary shape suggests the model is not inferring any generalizable patterns from the data.
Most virtual assistant platforms will not expose the internals of the classifier to you and thus will reduce the complexity of the models you produce. However, their classifiers have an interesting property — they perfectly remember their training data (Watson Assistant has this property). Thus they produce a model with as much generalizability as possible while still always perfectly predicting the training data. The resultant model can be visualized below.
This is all very abstract — what is it good for?
The ability to perfectly predict the intent behind an utterance creates a strong temptation to simply add any missed prediction into the future training set. Without some utterance in the training set, it may be predicted incorrectly, and after that utterance is in the training set, it will never be predicted incorrectly. What’s not to love? Why shouldn’t we just indiscriminately add hundreds, thousands, or more training examples into our virtual assistant — isn’t more training data better? Why did this blog post start off with a discussion of precision and recall?
Because the classifier perfectly remembers its training data, we can obviously improve the recall for the utterance in question by adding it as an example to a certain intent. This will impact both the precision and recall of similar utterances, potentially in non-obvious ways.
Our motivating example was “where is your new store opening” being classified to #Store_Hours instead of #Store_Location. Let’s assume you have noticed several variations of this like “Hi, where is your new store opening”, “where is the new store opening”, and “where is your new store opening in North Carolina” each of which is not being classified to #Store_Location (a recall error for this intent). As discussed above we could correct the prediction of each utterance by adding them to the training data — improving recall such that we’ll never get these statements wrong again.
Overfitting a chat classifier can introduce precision problems
First let’s visualize what happened by adding many closely related examples. The intention behind adding specific examples was to fix those examples. The change in the system may not be obvious.
Unlike the previous post we have achieved overfit through adding examples rather than adding features. The classifier attempted to generalize from the new examples. It vastly increased the size of the #Store_Location space and shrank the size of the #Store_Hours space. (See related article: Every piece of training data has a gravitational pull on an AI system.)
The classifier also learned that “new store”, “store opening”, and “new store opening” are strongly correlated to #Store_Location. This means the confidence of #Store_Location will increase for each of these questions:
· When is the new store opening?
· Is the new store opening for Christmas?
· Coupons for the new store opening?
· Where can I find out when the new store opens?
For some of these questions the increase in #Store_Location confidence will be enough to make it the predicted intent. In other words, fixing the recall errors introduced new precision errors! #Store_Location is now a source of precision errors, and several intents including #Store_Hours are now suffering from recall errors.
Even more nefariously the confidence will be lowered for other questions that used to go to #Store_Location, particularly if they lack the ‘new’, ‘store’, or ‘opening’ phrases. “I need help finding you” might now go to #Appointments. These could create new recall errors for #Store_Location and depending on the amount of training data you added for the initial problem, you might actually reduce the overall recall of the intent you were trying to improve!
Hopefully you are no longer tempted to reflexively fix all missed predictions by adding new training data.
Improving a chat classifier without overfitting
There are other tools you should have in your toolbox when improving your classifier.
Maintain a balanced and representative training set. In the previous example, adding one example might have sufficed to correctly predict the other examples. When we added many similar examples it skewed the training set and affected many unrelated questions. If 10% of location questions are about new stores, then 10% of the location examples should involve new stores. Chase patterns, not one-offs, when improving your training data.
Review intents for confusion and improve with entities. If your stores have banks inside them you might be tempted to create #Store_Hours and #Bank_Hours intents. These intents are likely to be confused as most examples will differ by only a word. Consider a single intent (#Hours) using an entity (store or bank) to determine which one a user is asking about. Confused intents can be improved by adding, merging, or deleting other intents as well as adding training examples.
Use confirmation or disambiguation when you’re not sure. If the top intent has a low confidence you can ask the user for confirmation (“Did you ask about Store Hours?”. If the top intents has similar confidence you can ask the user to disambiguate (“Did you mean Store Hours or Store Location?”). Watson Assistant has built-in disambiguation capability.
Focus on runtime accuracy not training accuracy. There are several methods, including k-folds cross-validation, to infer future accuracy while working only with the training set. These methods are only predictive if your training data is perfectly representative of the data you will experience at runtime. Instead, limit your use of these methods to identifying glaring error patterns in your model. After your model is deployed, build a test set of data the model has not been trained on and regularly inspect its performance on this set.
See these additional posts on improving chat classifiers:
· Testing Strategies for Chatbots (Part 1) — Testing Their Classifiers
For help in implementing best practices, and avoiding overfitting, reach out to IBM Data and AI Expert Labs and Learning.