Why The Overall Voicebot Solution And User Experience Are More Important Than Speech Accuracy

Marco Noel
IBM Watson Speech Services
10 min readApr 27, 2020
Photo by Icons8 Team on Unsplash

By reading the title of this article, I’m sure everyone thinks I stated the obvious. But you would be surprised how many times I get into a discussion with customers where the first thing they ask is: “I am expecting 90%+ accuracy from speech recognition”.

Based on my 4 years of experience in deploying Watson IVR Self-Service voice solutions, I want to set the right expectation around this unrealistic 90%+ speech accuracy:

  • According to multiple independent research papers, a human deals with a 4–5% Word Error Rate on average. I listened to call agents in call centers and frequently hear them re-prompting the users to make sure they heard them correctly. On any voice engagement, I encourage you to do the same: talk to human agents. Ask how they deal with difficult challenges like bad line, thick accents, noisy environment. You will learn a lot more than you think.
  • When adding real-life environmental factors like bad phone line, background noise, crosstalk and heavy accents, Word Error Rate drops even more. On all voice engagements I did, it’s a major reality check for customers: if a human cannot understand what is being said, speech recognition will not do better. It’s not magic.
  • The return on investment (ROI) to reach such a high level of accuracy is simply not worth it. It requires a lot of time and effort to reach that level of accuracy. Training Watson STT, just like any cognitive engine, follows a logarithmic curve (see figure below).
Average amount of effort (in weeks) required to train Watson Speech-To-Text

On typical voice engagements, you usually achieve the bulk of your accuracy improvements within the first 2 weeks. Then, as you can see on the graph, each percentage point requires more work and effort. Of course, results may vary based on the use case and the environment your users call from.

Watson Speech-To-Text delivers a lot of features and controls to train its speech engine and handle noise like Language and Acoustic Model customizations, Grammars, Speech Activity Detection, and much more (check the documentation for more details). But even with all these great speech features, speech accuracy will never ever guarantee a great user experience and user adoption.

Then how should you really evaluate and measure your IVR solution?

Start by selecting meaningful business metrics and financial objectives

Photo by Frank Busch on Unsplash

When you select your use case and build your business case, you should have a list of business metrics and financial objectives. Here are some common ones I have seen:

  • Call deflection: Process NN% of current call volume related to <use case>
  • Call duration: Reduce the call duration by NN% from X minutes to Y minutes
  • Cost savings: Cost related to call centers reduced by X%
  • Call completion: X% of users are completing transactions without transferring to an agent.

You should never lose sight of these very important metrics because that’s how you should really measure your solution.

The speech accuracy, just like all other metrics for each Watson component should ONLY be used to train, test and track progress during the development of your solution, known as Unit Testing. They should NEVER be used as business measures. The next sections explain why.

A great voice solution is greater than the sum of its parts

Photo by Antenna on Unsplash

Here goes another obvious one! But to be honest, I still frequently deal with customers still focusing on the performance of each individual component instead of the overall solution. I think this comes from “old” legacy IT project mindset which is totally different than cognitive projects.

With Watson Assistant for Voice Interaction (WAVI), you have multiple components like the Voice Gateway/Voice Agent, Watson Assistant, Watson Speech-To-Text and Watson Text-To-Speech. If you have integrations with backend systems, you would have a custom Service Orchestration Engine (SOE).

All these parts work closely together but mostly, in support of each other.

In every voice project I worked on, when we were dealing with an issue, we always looked at it through the whole solution and avoid any direct finger-pointing at a specific component. Everyone on the team provides their resolution if they were asked to fix it — from the Watson Assistant developer, the Speech-To-Text developer, down to the SOE developer. Then, we ask ourselves the following questions:

  • What is the quickest solution?
  • What is the least impactful solution?
  • What is the best long-term solution?

With a voice solution, everything starts with speech recognition but it does not mean necessarily that every fix has to happen in Speech-To-Text.

It could be as simple as adding an entity synonym in Watson Assistant, which takes seconds, building a quick SOE post-processing routine which can just take a few hours or creating a custom word in STT. When adding more training to STT, especially to fix an issue, there is always a risk of introducing regressions. Careful experimentation is always recommended.

Users really don’t care about accurate speech transcription in an IVR Self-Service. What they want is their transaction to complete successfully with a good user experience.

Focus on the voice user experience, not the technical features

Photo by Berkeley Communications on Unsplash

Let’s use the example of a user checking the status of his insurance claim.

We normally break it down into three major phases:

  • Intent recognition (how the user is asking for claim status in natural language)
  • Authentication (key unique information to identify your policy owner)
  • Transaction (supplemental information needed to complete the claim transaction)

A big misconception I frequently hear is : “Less prompts equal better efficiency and better user experience”. From my experience, this is just not true. This might work fine in a chatbot but it’s the total opposite for a voicebot.

Let’s see with the human example:

  • Agent: “How may I help you?”
  • User: “My name is John Smith and I need to check the status of my claim. My policy number is 123456. My date of birth is January 22, 1980. I filed my claim on April 1, 2020”
  • Agent: “Hello Mr. Smith. Sure I can help you with your claim but would you mind giving me your policy number again please?…”

Normal agent behavior. Ask any human agent and they will tell you the same thing: ask one data input at a time and make sure you got it correctly. Unless you are a super-agent…

In Watson Assistant, these long utterances with slots and entities might work great in a text chatbot but with voice, it is very error-prone (eg. sudden noise, long pauses in the middle to search for information, etc). A common pitfall is trying to convert your text chatbot as-is into a voicebot. While it is technically very easy to do, I have not yet seen a successful one mainly because of the painful user experience. Here’s why:

  • Intents: Users do not ask for things the same way in a text chatbot using a keyboard versus talking on their phone. You will need to adapt your intent training in Watson Assistant.
  • Questions/Prompts: Long, detailed and precise questions might be very useful in a text chatbot but can be very annoying and irritating with voice. Keep them short and to the point. Discard things like “Thank you for you response. Would you please provide me your …?” and use something like “What is your …?”. One funny user feedback I got during an engagement was “Thanking me every time I give a valid response does not make this nicer, but more annoying”. Point taken!
  • Answers: While having a detailed answer with links is really cool and valuable in a text chatbot, it’s generally very long and tedious to listen to with voice (spelling a URL letter by letter-Ugh!). I’m not even talking about document paragraphs returned from Watson Discovery. Curation of your answers will be required with some support from SMS text messaging when needed (eg. URL pointing to detailed documents).
  • Overall Dialog business flows: A text chatbot can deliver clickable buttons and dropdown lists. Not voice. In order to improve your voice user experience, for example, you might have to break down a single data collection step into multiple smaller steps. Simply watch how your human agents natively do it with their callers.

INTENT RECOGNITION

For intent recognition with Speech-To-text, the general training guideline is to keep the utterances short (approx. 10–15 words). When dealing with automated voice systems, humans usually keep it short and simple. Also, as a starting point, try to match the intents and entities training in Watson Assistant with Speech-To-Text, creating a Language Model customization with them.

Check the latest Medium article Andrew R. Freed and myself recently published on how to quickly train Speech-To-Text from intents and entities in Watson Assistant

AUTHENTICATION

Now, for the authentication phase, it is important that you select the right data inputs for your business process, especially those who are “unique” (no duplicates):

  • IDs — eg. Member ID, Card ID, Credit Card, Social Security
  • Contract Number — eg. Policy Number, Invoice Number
  • Dates — eg. Date of Birth, Date of Transaction

There are also some that I do NOT recommend to use for speech recognition:

  • First Name/Last Name
  • Email Address
  • Street Name

I strongly advise to stay away from these. Why? First and foremost, there is no commonly accepted spelling or pronunciation with all three of them. They can have multiple variants and many have foreign origins, thus challenging for humans not knowing the right pronunciation, increasing the chance of getting them mispronounced, thus not transcribed properly. First name and last name are not unique — how do you handle “John Smith” if you have 25 of them? Use the unique data keys instead.

TRANSACTION

Once you have gone through the intent recognition and the authentication, you should have some payload information from the backend systems about the user:

  • Full name
  • Full address
  • Email address
  • Home and Cellphone numbers
  • Dependents
  • Policy coverages
  • Invoices and payments
  • List of past and current claims

You can leverage all this information to personalize and streamline the process. To complete the transaction, you can select the last filed claim, play it back and validate if it’s the correct claim he/she is inquiring about. Once confirmed, you can return the status and pending actions by voice or to the email address on record, etc.

Reprompting is NOT a bad thing

Photo by Sebastian Herrmann on Unsplash

Well, that’s all great but what happen if the information provided by the user is not recognized properly? Humans have conversation mechanisms to clarify and disambiguate to avoid any misunderstanding. Re-prompting is one of them. It’s very common to build it as part of your overall solution but you have to make sure it does not hinder your user experience.

I’m still surprised how some customers consider a re-prompt as a failure of the solution.

The main objective of a re-prompt is to clarify, validate or disambiguate what the user said, then give the opportunity to provide the correct information, then continue to the next step. This is normal human behavior.

What I have learned over the years is that users do not mind re-prompts… as long as they do not have to repeat it too many times! The typical “3 strikes” rule is very commonly used as a “de-facto” standard: after 2 failed attempts, transfer the call to a human agent. A good reprompt design is a very important part of the user experience.

My general recommendation for re-prompt on different types of inputs is:

Recommendations on how to handle different voice inputs and re-prompt in case of failure

One of the great features of WAVI is the ability to send SMS text messages during a voice interaction. This is not a Speech-To-Text feature but more a Voice Agent / Voice Gateway one. You can dynamically send information like URLs and other text as an answer. But one of my favorites is the ability to prompt the user using SMS, for a tough voice data input like “First Name/Last Name”, “Email Address”, “Street Name” and others. Once they provide it, the voice conversation can naturally continue. This avoids major user frustration and multiple call transfers to human agents.

But what if my users use a land line / desk phone and cannot receive text messages. Here are some options for this specific situation:

  • Use the existing cellphone number stored in the user’s profile
  • Prompt the user for the cellphone number to be used
  • Avoid using these tough data inputs as prompts

Here’s an Watson Assistant JSON example sending an SMS message (in bold):

{
"output": {
"text": {
"values": [
"Testing in $city , $state is available at multiple locations. You will receive a text with detailed information on nearby hospitals."
],
"selection_policy": "sequential"
},
"vgwActionSequence": [
{
"command": "vgwActSendSMS",
"parameters": {
"message": "Covid-19 Nearby hospitals in $city, $state - $hospitals_url"
}

},
{
"command": "vgwActPlayText"
}
]
},
"context": {
"city": "<? input.text ?>",
"state": "<? entities.city.value ?>",
"hospitals_url": "https://www.google.com/maps/search/?api=1&query=hospitals$city$state"
}
}

For more information on how to set it up, check the Voice Gateway SMS feature documentation.

I have also seen customers using DTMF (touch-tone) as a reprompt mechanism to deal with numeric and alphanumeric data inputs (IDs, dates, etc), especially in noisy environments. The Voice Gateway / Voice Agent DTMF feature allows you to allow either voice, DTMF or both at the same prompt. Here’s an Watson Assistant JSON example allowing DTMF (in bold):

{
"output": {
"text": {
"values": [
"Please enter the extension you are trying to reach and then press #."
]
},
"vgwAction": {
"command": "vgwActCollectDTMF",
"parameters": {
"dtmfTermKey": "#",

}
}

}
}

For more details on the DTMF feature, see our Voice Gateway documentation on DTMF.

To learn more about STT, you can also go through the STT Getting Started video HERE

Marco Noel is an Offering Manager for IBM Watson Speech Services, focused on educating customers to successfully implement Watson Speech-To-Text, Text-To-Speech and Watson Assistant for Voice Interaction (WAVI). He always loves to see and learn how creative customers are with these technologies.

--

--

Marco Noel
IBM Watson Speech Services

Sr Product Manager, IBM Watson Speech / Language Translator. Very enthusiastic and passionate about AI technologies and methodologies. All views are only my own