How To Set the Optimal Confidence Threshold for Your Assistant

Mitchell Mason
IBM watsonx Assistant
5 min readSep 6, 2018

Shoutout to Christie Schneider and Adam Benvie for their contributions to my writing, grammar, and thinking. At this point, I’m not sure I even contributed to this myself!

When creating a virtual assistant, you want it to be able to answer as many questions as possible, as accurately as possible. We define those two metrics as coverage and effectiveness. Coverage is the amount of times Watson actually tries to answer your users’ questions. This can include intents over 0.20 confidence, entities found in the user input, or even matching context, depending on how you structure your dialog. Basially anytime Watson does not default to anything_else. The most common pattern is to:

1. Match the user’s intent first

2. Look for any entities to respond more precisely to their question

3. Include any context as needed

Let’s focus on intent confidence, as it currently has the biggest variance.

To many customers, 0.20 is a bit too risky, or in other terms, not effective enough. But raising that confidence threshold yourself is not an obvious task. If you go too high, you lose more and more coverage rendering your virtual assistant unable to confidently answer, and keeping it too low your assistant may make too many mistakes.

The screenshot above shows how coverage and effectiveness are perfectly opposites of each other. This is not actually true in practice. Depending on your data, you may be more effective with low confidence, and depending on the range of questions users ask, you could also have higher coverage even with a high threshold. The above chart is representative of the theory, and holds true even if the lines may be somewhat skewed depending on data and questions your users ask.

The green line shows what happens if you have an extremely low confidence. Typically Watson would try to answer almost any question that comes through your virtual assistant, even if there is actually quite low similarity between the user’s question and the intent. For example, “What color car do you drive?” could match with “what is the fastest car.” This something Watson may not be taught to answer, but your virtual assistant would give an answer anyway, because it is so similar.

The red line represents what happens if you have an extremely high confidence. In this scenario, Watson won’t attempt to answer very many questions, but will more than likely be correct when it does. The downside here is that it may not answer some questions that it could have answered accurately. If a user were to just type “fees” and Watson was trained to answer complete sentences about the topic, similar to “What are the fees for a credit card?” and “What fees do credit cards have?” Watson would still not try to answer because the confidence level is not high enough. You would be missing opportunities that should have been taken.

The optimal level is somewhere in the middle, like the blue line in the above chart. The optimal line can be found just before Watson starts to answer more questions incorrectly.

Let’s take an arbitrary sample:

The confidence threshold is 80%. Out of 100 questions, Watson answers 60 questions correctly, 2 incorrectly, missed 25 questions it should have answered, and ignores 13 correctly.

There is clearly missed opportunity here, so the confidence should be lowered. If we decrease it from 5% to 75%, we will see a change:

· 65 questions answered correctly

· 3 questions answered incorrectly

· 20 missed questions that should have been answered

· 12 questions correctly ignored

We would continue this trend until we saw Watson make a bigger decrease in effectiveness, than coverage increased; this is how we find our optimal threshold. See how this all comes together in the below chart.

This tells us that our optimal confidence would be between 75% and 80%. With this change, Watson answered 3 more questions correctly, but missed 4 more. With confidence levels higher than this, Watson was still able to gain more in correct questions vs. incorrect questions. Any confidence level than this is trends lower and shows an increase in the incorrect answers.

This chart is determined is by running a test. To create your own, you would need to include your dataset and a sample of blind test questions that Watson was not specifically trained on. This blind set should be representative of real customer questions, and include some noise that Watson should ignore by design. Run the test, regardless of any confidence thresholds, and simply label if Watson would have answered correctly or incorrectly. Then, place your threshold at incremental levels. A correct answer over the threshold goes in the ‘correct’ column, and incorrect answer over the threshold goes in the ‘incorrect’ column. A correct answer below the threshold goes in the ‘Missed’ column and an incorrect answer below the threshold goes in the ‘ingored’ column, because we wanted Watson to be smart enough not to answer.

You may also want to set a higher threshold than what is optimal if you are measuring Watson against another system. If your goal is to always have Watson do better than your human agents, and they have a higher ‘accuracy’ (questions answered correctly vs. total questions) then you may raise the threshold even though you are missing some coverage. If you have an elegant fallback mechanism and are deflecting questions efficiently , you may be willing to lower the threshold beyond what is optimal to increase coverage even at the cost of some effectiveness.

--

--