Making voice assistants good conversational partners

Designing for errors [Part 1]

4 min readMar 31, 2018
Image source:

Humans misunderstand each other all the time. They misunderstand accent, misinterpret meaning and sometimes even enter a room forgetting what they were going to say. It is very common for all these things to happen but then why are users not forgiving when they encounter the same with a machine? The reason, when humans interact with other humans and a misunderstanding occurs, they collectively resolve it. However, machines stop being good conversational partners by not helping their users resolve these misunderstandings. Humans are wired to speak. Speaking comes naturally to us. This is why we need to teach our machines how to speak and be good conversational partners, not the other way round.

Natural language systems invite users to say anything but it is practically impossible for machines to always understand everything they say. Errors are bound to happen and we cannot always prevent them. However, what we can do is make our machines robust enough to inform their users on what went wrong and work with them to resolve it. Humans use a combination of verbal and nonverbal communication like eye gaze and gestures to resolve misunderstandings with other humans. But for the purpose of simplicity, let’s talk about voice based applications in which the only output modality is speech and see how they can recover from errors just by using verbal communication.

Before designing error recovery strategies, it is important to understand what kinds of errors might occur when users interact with voice based applications.

Kinds of errors

Recognition rejects: These errors occur when the system fails to obtain any interpretation of the user input. But the system knows that there is a problem. These are called non-understandings.

Example1: Alexa misheard egg as “that/ten”

Above image is an example of a recognition error. Alexa misheard “egg” as “that” (although the voice feedback says ten).

False accepts: These errors occur when the system incorrectly interprets the user input and and proceeds without realizing that something is wrong. But the system doesn’t know that there’s a problem. These are called misunderstandings. These errors are generally accompanied by users trying to initiate correction, further leading to more errors.

Example2: Misunderstanding user’s intent

When user asked “Alexa! Can you order me some food?”, it recommended Snack Chest Care Package from Amazon to the user. This is an example of misunderstanding.

Speech timeout: This error occurs when the system doesn’t hear any speech. This could be because the user said something but the system didn’t get any input or the users took too long to respond as they didn’t know what to say.

Error recovery strategies

Escalating detail: When there are recognition rejects or speech timeouts, the system should provide details about the error and examples of what the users should say. With each subsequent error, more detail should be provided.

Caveat: Be very mindful of how you tell the users what to say. Instead of putting words in their mouth, try to give them intuitive examples of what they can say.

Rapid re-prompt: This is similar to escalating detail except that here the system assumes that the user knows what to say and needs one more chance. It doesn’t provide the detailed instructions about what the user should say right away.

In example 1 where Alexa misheard “egg” as “that” , I knew that Alexa can answer my question (since I had asked the same question before). What I didn’t know was that it did not hear me correctly. If it had re-prompted me by saying “Sorry? Can you please repeat?”, there would have been a higher chance to get it right the next time.

Alternate approach: If there are more than one recognition rejects, try an alternative approach.

Global error correction: Every application depending upon its nature should establish three thresholds: the maximum number of errors in a single dialog state (the state-specific error count, typically 3), the maximum number of errors counted globally and the maximum number of disconfirmations (typically 2). When these thresholds are reached, the system should have a defined way of helping the user.

Confirmation strategies: False accepts can be difficult to detect because the system doesn’t know that it is wrong. In order to prevent errors arising from false accepts, the system can confirm its hypothesis with the user in two ways:

Implicit confirmation: Letting the user know what was understood, but do not ask them to confirm. Ex. “Ok, setting an alarm for 4:00am”

Three-tiered-confidence: Use the confidence score to decide whether the confirmation is needed. If confidence is high, don’t confirm but if it is low and above the reject threshold, confirm with the users, otherwise reject.

In example 2, instead of recommending random Snack Chest Care Package from Amazon, Alexa could have confirmed by asking: “Do you want to order food from Amazon?”

There are cases when confirmation is not necessary, for example, when the cost of the error is low in terms of time lost and the user confusion.

Listening time window: To prevent speech timeouts, the time window in which the system should still be listening after the wake word should be carefully chosen (10 sec is a good starting point).




Passionate about designing Voice Interfaces. VUI Designer @Amazon. Portfolio: