Machine Learning Confidence Scores — All You Need to Know as a Conversation Designer

Published in

Voice Tech Global

7 min readAug 25, 2021

Recently we advised a government organization on how to understand why their Interactive Voice Response (IVR) agent failed on a particular user request.

This organization has a self-serve conversational application that allows constituents to manage and renew their driving licenses, vehicle tabs, and professional licenses. However, when users stated, “I want to renew my license,” the bot would always surface a message indicating that it did not understand the utterance.

Transcripts of three conversations with the Gotham IVR

We had two puzzling questions here:

To which Intent (“Driving License,” “Vehicle Tab,” or “Professional License”) do you add the utterance?
Why does the system return a No Match despite having “renew my driving license,” “renew my license plate,” and “renew my real estate license” in the utterances of our language model?

In this article, we will walk you through how we solved this puzzle thanks to our understanding of Machine Learning confidence scores. We’ll also touch on how they impact the conversational products we are building as Conversation Designers and how we can mitigate their impact.

What is a Confidence Score?

A Confidence Score is a number between 0 and 1 that represents the likelihood that the output of a Machine Learning model is correct and will satisfy a user’s request.

The output of all Machine Learning (ML) systems is composed of one or multiple predictions. For example, YouTube ML will predict which video(s) you want to see next; Uber ML will predict the ETA (estimated time of arrival) for a ride.

Each prediction has a Confidence Score. The higher the score, the more confident the ML is that the prediction will satisfy the user’s request.

Microsoft’s breakdown of Confidence Score’s meaning for Conversational AI can be categorized as follows:

Over 0.7: the prediction is a strong candidate for answering the user query.
Between 0.3 and 0.7: the prediction can partially answer the request.
Below 0.3: the prediction is probably not a good choice.

Mapping between confidence scores and their meaning

Confidence Scores are the result of Machine Learning model training.

Confidence Scores in Conversational AI

In Conversational AI, ML is essential in many stages of the processing of the user request:

During Natural Language Understanding (NLU): ML helps predict the Intent (which represents what the user is looking for) from an utterance (what the user said or typed).
During Automated Speech Recognition (ASR): ML will predict the transcription from the audio of what the user said.
During Sentiment or Emotion Analysis: ML predicts the sentiment (generally as positive, negative, or neutral) or the emotion based on the user utterance or the conversation (back and forth between the user and the agent) transcript.
During Natural Language Generation (NLG): ML will predict what to answer from the user utterance.
During Text-To-Speech (TTS): ML will predict the audio from the answer text in NLG.

Machine Learning impact on Conversational AI

As mentioned above, each prediction will have a confidence score.

One consequence of this heavy use of ML is that for Voice applications, the predicted Intent combines the confidence scores of the ASR and the NLU stages. The compounded risk of misinterpretation presents a great risk for Conversation Designers when they assume that the predictions are always correct.

Why Confidence Scores Matter For Conversation Designers

In our government organization case study, when the user said, “I want to renew my license,” the bot responded, “I’m sorry, we can’t help with that. Are you looking for information about driving licenses, vehicle tabs, or professional licenses?”

The first reaction is often to blame the lack of training data and add the utterance in one of our three intents. Because our three Intents — Driving License, Vehicle Tab, and Professional License — can all have that utterance, that option was not available to us.

No Match was suspicious because even if ML doesn’t do keyword matching, renewing the license was very close to some of the existing utterances in the language model. For example, “driving license renewal,” “renew my architect license,” or “renew my license plate.”

We decided to investigate what the system understood.

The team used Google Dialogflow as the NLU engine with the Phone Gateway for the channel on this project. One key advantage for debugging is that in the Dialogflow console, you can visualize the breakdown of the API results when testing, which shows all the details of the Machine Learning results.

We realized that when the user said “renew my license,” the ML predicted a No Match with a 0.8 score. This led us to discover that in the Agent Settings, there is a parameter called classification threshold (under the ML section).

The image shows the Machine Learning settings in Dialogflow CX — Machine Learning settings in Dialogflow

This documentation states that the machine predicts multiple intents. It will return a No Match if the Intent with the highest confidence score is below the threshold.

We finally figured out that the ML was actually matching one of the Intents, but the threshold was hiding it. We discovered this by lowering the confidence score threshold.

The image shows the JSON response from the natural language engine and highlights the 0.4 confidence score for the professional license intent — Diagnostic info in Dialogflow CX simulator

To provide a better response for a No Match when the Confidence Score met the threshold (e.g., a Confidence Score and threshold both of 0.8), we added in additional logic to change the response for that question.

Now, instead of “I’m sorry, we can’t help with that. Are you looking for information about driving licenses, vehicle tabs, or professional licenses?” the agent responds with “Driving license, vehicle tabs or professional?”.

We often hear that “chatbots don’t work” or “I asked Alexa, and it didn’t know.”

The standard approach to resolve these issues is to add more training data to the model. But as we discovered, it won’t fix the issue if the problem is happening at the Machine Learning level.

We now know that a No Match response has the following causes:

ML did not predict the Intent: In this situation, you can test adding more training data.
ML predicted the Intent, but some underlying rule prevents the matched Intent from being picked up: You need to dig into the tool or platform settings to determine if a rule is causing an unexpected issue.

How Conversation Designers Can Access Confidence Scores

You may be wondering if the process we used is something a Conversation Designer can do? The answer is “yes.”

Below we show a (non-exhaustive) list of standard tools/platforms and how you can access information about Confidence Scores.

Alexa Developer Console

Alexa does not show the actual Confidence Score but provides two useful data points to debug the NLU issues.

The first one is the utterance profiler: When you look at your Intents in the console, you can press “Evaluate Model” to access it.
The second one is the Device Log: If you are building a custom skill, the device log shows you Considered Intents. It’s helpful to find competing intents and understand why the application behaves a certain way.

the image shows the Alexa Developer Console and the device log information with a highlight on the other considered intent — Device log in the Alexa Developer Console simulator

Google Dialogflow

In the simulator, you can view the Diagnostic information as you test, which will show you the Confidence Score. The downside is that it only shows the intent with the highest confidence score and not the other considered Intents.

We would recommend for Conversation Designers to know that:

The classification threshold is essential as it impacts intent detection and, therefore, the way you design your conversation.
The knowledge results preference orchestrates the balance between knowledge base results and the results from the Intents.

Rasa X

Rasa allows full control over the Conversational AI process. One of the great features is the ability to review conversations.

The interface provides a complete breakdown of all Confidence Scores and ML information during the conversations between users and the bot.

The image shows the Rasa X interface for conversations explorer with a highlight on the ability to view the confidence scores and the other considered intents with their scores as well — Reviewing conversations in Rasa X

All the other tools

We can’t list all the available setups out there, but for the tool you’re using, here are few things that you can do:

Reach out to the tool developer and ask them if they have any parameters for Confidence Scores that you should be aware of.
Ask the developers on your team if they know or can investigate.
Reach out to the Voice Tech Global Slack Community :)

Using Confidence Scores In CxD

This article covers Confidence Scores for problem-finding purposes only, but you can also use the knowledge of Confidence Scores for disambiguation in Conversation Design.

If you want to know more, you can join our ACXD Course to learn about our strategies for designing for and with ambiguity in mind.

If you liked that story, please give us a few claps 👏.

We would love to hear your own experience with confidence scores and language model optimization as Conversation Designers in the comments!