Designing for Voice — Part II

Baisampayan Saha
Go-MMT Design
Published in
8 min readMar 21, 2018

--

We have now completed some of the basics in Part-I of this article. Let’s now understand how the machine works to figure out what we say and how it responds. Understanding the nuances of how the machine works would help us in designing better VUIs.

Automated/ Artificial Speech Recognition (ASR)

As we have already understood that listening to what users are saying and interpreting it correctly is the most important and challenging task of all voice assistants/ services. Let’s talk about “Automated Speech Recognition” (ASR) systems found in such services.

Diagram showing the configuration of an ASR engine

In an ASR system —

  1. The audio output from a user is taken in by the front-end & feature extraction systems and then sent to a classifier that checks for known patterns in the audio output. This data is also used as a training data for the system.
  2. The classifier continuously talks with the Acoustic/ Phonetic Models and Language Models. The language models or Grammars are part of the system that the voice service providers include to interpret what users are saying.

Take a look at the screenshot below to understand it better. It is taken from flow.ai Mac app. It is used to create voice/chat interfaces. The screenshot is of a voice interface prototype. The particular window is for training the prototype to understand various answers that are a “Yes”. I have put various entries that would map to the answer “yes”.

Screenshot of flow.ai showing a screen on training the machine to understand various replies

The above screenshot is an example where the machine is trained to detect when the user has said: “Yes” in a “Yes/ No” question. The user can simply say “Yes”. But instead, say any of the options shown in the screenshot. If we do not let the machine train that there are other possible answers that map to a “Yes” in a conversation, then most of the time the machine might end up not understanding what the user is saying.

“The most important thing in communication is hearing what isn’t said.” — Peter Drucker

Diagram showing the various status of an ASR output

When the user speaks something to the VUI, the ASR after analysing the audio input, sends an output which is then converted from text-to-speech engine to an audio output for the user to listen. The output is based on the status of the analysed result. The screenshot above shows different status of the analysed result.

When ASR is very confident (probability that the audio input from the user matches with the Grammar used in the ASR engine and is a possible/ valid answer to the question asked by VUI) that the audio input is a valid entry than the confidence is high. After analysing the audio input, ASR engine might come to a conclusion that it is not very confident about what the user said matches totally with its Grammar. But the audio input did had some words that matched with the Grammar. Then instead of totally rejecting the audio input from the user, it can again ask a question to the user to verify the answer. We would see an example of this in the sample conversation below. If the audio input cannot be deciphered totally by the machine, then the confidence is very low and this type of entries are called “garbage”. Tackling garbage is another big topic to discuss. There can be other types of entries also that the machine might not understand —

  1. When the user is silent and the VUI is waiting for user to reply, or
  2. When the user spoke too soon for the machine to pick up the audio. This can happen if the user spoke just before the smart window/ pause that VUIs use to pick up user replies. This error may creep up mainly in half-duplex systems. While designing, one has to take special consideration for such scenarios.

Let’s now see a scenario that is taken from one of the transcripts of EIG that Bruce Balentine presented during the workshop on “Designing for Voice User Interfaces”.

# A Customer SurveyAdrian Lannin has just completed a transaction with an agent at the Woodgrove Bank. At the end of the call, she is asked whether she would be willing to take part in a customer survey for a rebate of $5 on her banking charges. She agrees. She is connected to an IVR which asks her a series of simple questions regarding her banking habits. Adrian is also able to record a brief audio comment, but she declines. She completes the call with instructions on how to claim her rebate.Agent connects Ms. Lannin's call to the IVR.IVR: Welcome to the Woodgrove banking survey...and thank you for participating. Here's how the survey works.IVR: I'll ask you a few questions about your banking habits. If you would rather not answer a question, just say, "no comment". Are you ready to begin with?##
The first yes-no question is easy and designed to give the user immediate practice with beeps. As the IVR finishes talking, a beep sound is played to indicate the user that its users turn to talk.
##
User: Yes##
The user responds predictably to the question, and is immediately rewarded with a non-speech grounding in the form of the <Close Beep>
##
IVR: <Close Beep> That's great. Approximately how many years have you been a customer with us?##
The caller has now learned the beep protocol
##
User: About 2 years.IVR: <Close Beep> Do other members of your household bank with us as well?User: Yes. (Spoken softly, with an obscuring noise from the background)##
Just a random occurrence of slightly soft speech. This can be seen a lot on yes and no answers -- the speech is not detected because it is too low in energy or too short in duration. Thus the IVR waits for some time for the user to respond again
##
IVR: (waiting...didn't hear the "yes"--for 3 seconds or so)User: Yes. (spoken more loudly)##
Because the user doesn't hear the grounding tone, she spontaneously repeats the "Yes"
##
IVR: <Close Beep> Considering that we offer services such as checking and savings, as well as money market accounts, roughly how many accounts does your entire household bank with us?User: Uhm, gosh--well let's see....(pause)IVR: Just roughly...##
The IVR does not understand sentences like the one spoken above. But as we train the IVR to respond to various types of text that can be the start of an answer, it can mimics a conversation between two persons
##
User: I'd say 4 accountsIVR: That was four?##
The IVR confirms if the user has said "four". A good way to validate again if the user has answered correctly
##
User: YesIVR: <Close Beep> What's your household's annual income?User: No comment##
As the user does not answer the question which is a personal question, the machine is intelligent enough to move past all personal questions and do not ask any more such questions. This can be a "privacy sensitivity" test question
##
IVR: <Close Beep> Using words like "never", "very frequently", or "all the time", how often do you use our web site?User: Oh, I'd say almost never##
The user has not said exactly how the IVR has asked the user to respond. The machine can be trained to understand different answers from users which would map to one of the options told by the IVR
##
IVR: <Close Beep> How often do you call our live-agent call center?User: hmm...frequentlyIVR: <Close Beep> How often do you use our automated telephone system?User: Not very often.IVR: <Close Beep> How often do you drive through one of our branch locations?User: Well, I'm...you know, especially when I'm making deposits...I usually take out some cash, and ...well,...IVR: (momentary pause) Would you say, "frequently"?##
The IVR does not understand random speech unless it is trained to understand random texts. In such scenarios, the IVR can offer a middle of an obvious scale of choices. This is an example of "kinder and gentler" machine coaxing
##
User: Very frequentlyIVR: <Close Beep> Finally, do you have any general comments about our banking services?User: No, I don't think soIVR: <Close Beep> Thank you for taking part in this survey. To claim your $5 rebate please visit our website at www.woodgrovebank.com. Your rebate code is WGKW. Your pin is four-eight-nine-seven. Would you like to hear this again?##
Again, the IVR asks a "Yes/ No" question if the user has heard the information properly or should the machine repeat it again
##
User: NoIVR: Thank you for calling Woodgrove Bank. Goodbye!(Sound of user hanging up)

The above example shows how machines are trying to be more natural, leaning towards a human-to-human conversation.

Though there has been a significant rise in voice-based services and devices in the market, user take-up is very low. A paper by Roger K. Moore talks about this. He suggests that there is a habitability gap between the user expectations and the features offered by the voice services.

If we compare the contemporary voice-based interfaces have a very human-like voice as compared to older interfaces which had machine-like voice. Does machine-like voice work better in voice assistants which are mostly task-based?

It is seen that when the voice is machine-like, human expectations do not increase much. Thus each interaction is very transactional and sometimes limited. But when the interface voice is near-human, it may lead to the uncanny valley effect, in which a near human-looking artefact (such as a humanoid robot) can trigger feelings of eeriness and repulsion in an observer; as human likeness increases, so affinity increases until a point where the artefact starts to appear creepy [Mori,M.(1970). Bukiminotani (theuncannyvalley). Energy,7,33–35., Roger K. Moore (2016), Is Spoken Language All-or-Nothing? Implications for future speech-based human-machine interaction].

One example of this could be the privacy issue about voice interfaces. The voice interfaces are listening to the user all the time. Are these services used with malicious intent? We don’t know for sure. There have been issues where chat bots were turning racist. The same can happen with voice interfaces. While designing for voice, do keep in mind the capabilities of the technology used and the purpose of the interface. As Bruce Balentine says —

“It’s Better to Be a Good Machine Than a Bad Person”.

[Balentine,B (2007). It’s Better to Be a Good Machine Than a Bad Person: Speech Recognition and Other Exotic User Interfaces at the Twilight of the Jetsonian Age. Annapolis: ICMI Press.]

Always test your hypotheses using prototypes — low fidelity to high fidelity. Wizard of Oz is one great tool for doing so. The more you test the more you would find error states which you might not even have considered.

Hope this article covers some basic principles of “Designing for voice”. Happy designing and prototyping for “Voice Interfaces.”

If you have missed out anything, read Part-I of the article from the link below-

Designing for Voice — Part I

--

--