How should a UX designer plan for voice interaction?

丹尼爾 Daniel
uxbreakfast
Published in
12 min readSep 14, 2020
Voice assistant at home. Photo: Unsplash

Software mostly uses graphic interface design to achieve interaction with users to complete tasks. After the invention of smartphones, we established a graphical user interface (GUI), and there are many examples and guidelines for reference. For example, Google’s Material Design, books talk about graphic user interface with principles like the master, Don Norman. Compared with GUI, the field of the voice user interface (VUI) is still in a very preliminary stage.

In the era without the Internet and smartphones, the voice system is the primary medium for interacting with users and providing services, phoning to inquire about bank account information, or ordering air tickets.

Today, the interactive voice system is equipped with artificial intelligence technology and is called a virtual assistant. On the phone, we can coordinate the image interface design and touch control to make users have a smooth experience.

You may ask, is VUI design will be much simpler than GUI? Despite we don’t need to use Figma or Sketch to create UX flows and screens and building a comprehensive design element system. Voice is more complicated and takes time than we think. Intelligent technology increases the interactivity of voice services, which can complete more tasks while also increasing uncertainty and generating a large number of user scenarios.

Imagine that you are filling out a questionnaire about necessary personal information. Just type in the number, we will see the input and correct it instantly if you input wrongly, or setup constraint (use dropdown menu with age range) to reduce the error. In the worst case, you will say that the button is difficult to press. Turn into a voice assistant and ask you: What is your age? I answer: I am Forty. If the user is far away from the device, there is a chance of mistakenly identifying it as fourteen. As long as there is no error when inputting in GUI, the system will receive the correct information. In VUI, we don’t check how accurate we are. We want the assistant to understand our voice information. For this, I’m curious how the system should respond? How do we design our VUI? Therefore, I read Google’s Chief Dialogue Design Director, Cathy Pearl’s book-Designing Voice User Interfaces: Principle of Conversational Experiences. This book is terrific. I recommend you read it if you want to know more about designing VUI. In this article, I will introduce several exciting design pain points, plus my examples to illustrate the difficulty of developing VUI and some possible solutions.

1: Complement with GUI

In the traditional telephone voice system (Interactive Voice Response-IVR) uses voice feedback only. When it comes to smartphone generation, we use voice and visual to give feedback. At the beginning of the design, it is necessary to figure out which information should use voice and images should represent additional information.

For example, I want to find out the players who scored the most goals in the English Premier Football League. It would be complicated to express them in terms of voice. “The first place is Jamie Vardy, scores 23 goals. The second place is Danny Ings, which scores 22 goals. The third place is…” the voice assistant said.

If it displays as a table format, it is simple and straightforward, and the system uses the data to facilitate the user to check the required information further.

Me: Okay, google. Can you show me the top scorer of English Premier Leauge

Google: Here is the information from Squawka

Some virtual assistants without interface display, such as Amazon Echo, will send data to the user’s smartphone.

Google assistant shows the result on screen
Showing the table of EPL top scorers

2: Dialogue Design and Cognition

In IVR, due to equipment constraints, users press buttons on the phone to execute tasks through, reducing the chance of errors.

However, VUI dialogue requires the use of cloud AI to interpret the user’s voice to give feedback. The user dialogue is ever-changing. The most common is misunderstand the pronoun of the conversation rather than its response, “Sorry, I don’t understand.”. Human conversations are not as intuitive as computers, and they do not speak at the same speed when they need to think. We usually use pronouns in conversation. For example, I am thinking of buying a Kindle Paperwhite e-reader, and I asked Google Assistant:

Me: Okay, google. What is the price of Amazon Paperwhite

Google: Here is the information from amazon.com (it shows me a USD price of a different model of amazon e-reader)

Me: Can you convert the price to the Hong Kong dollar

Google: 1 United States Dollar equals to 7 Hong Kong dollars and 75 cents.

The first turn correctly shows the price I want to see. When I want to check the local selling price based on the data, it lost the recognition word “price” actually echoed the previous conversation, “Paperwhite’s Price.”

Showing the price of Paperwhite

Look at another good feedback example.

Me: Have you heard of skytree?

Google: Here are some results on the web

Me: Where is it locate?

The address for Tokyo skytree is 1 Chome-1–2 Oshiage, Sumida Cit, Tokyo, 131–8634, Japan.

Me: What is the best restaurant there?

Google: Here is the summary from byfood. Where to eat in Tokyo Skytree. One, Sky Restaurant 634. Two, Skytree Cafe. Three… (Til saying to №8)

Google Assistant knows what I mean by ‘it’ and ‘there’ so that the user can finally reach the endpoint.

In IVR, due to equipment constraints, users press buttons on the phone to execute tasks through, reducing the chance of errors.

However, VUI dialogue requires the use of cloud AI to interpret the user’s voice to give feedback. The user dialogue is ever-changing. The most common is misunderstand the pronoun of the conversation rather than its response — “Sorry, I don’t understand.”

Human conversations are not as intuitive as computers, and they do not speak at the same speed when they need to think. We usually use pronouns in conversation. For example, I am thinking of buying a Kindle Paperwhite e-reader, and I asked Google Assistant:

Me: Okay, google. What is the price of Amazon Paperwhite

Google: Here is the information from amazon.com (it shows me a USD price of a different model of amazon e-reader)

Me: Can you convert the price to the Hong Kong dollar

Google: 1 United States Dollar equals to 7 Hong Kong dollars and 75 cents.

The first turn correctly shows the price I want to see

When I want to check the local selling price based on the data, it lost the recognition word “price” actually echoed the previous conversation, “Paperwhite’s Price.”

Look at another good feedback example.

Me: Have you heard of skytree?

Google: Here are some results on the web

Me: Where is it locate?

The address for Tokyo skytree is 1 Chome-1–2 Oshiage, Sumida Cit, Tokyo, 131–8634, Japan.

Me: What is the best restaurant there?

Google: Here is the summary from byfood. Where to eat in Tokyo Skytree. One, Sky Restaurant 634. Two, Skytree Cafe. Three… (Til saying №8)

Google Assistant knows what I mean by ‘it’ and ‘there’ so that the user can finally reach the endpoint.

3: Appropriate Confirmation and User Perception

It is essential to confirm important information repeatedly. For example, you need to transfer money from your account to a third-party bank account. But over-confirmation can make you impatient, especially for less important actions, such as setting meeting reminders.

Me: Can you help me to set up a reminder for a meeting with John on September 30

Google: At what time?

Me: At 3 p.m.

Google: Got it! I’ll remind you on September 30 at 3:00 p.m.

This dialog confirmation uses Implicit confirmation. Putting the confirmation information at the end will not make users feel repetitive. It is very different if we use explicit confirmation.

Me: Can you help me to set up a reminder for a meeting with John on September 30

Google: At what time?

Me: At 3 p.m.

Google: Got it! You want to set up a reminder for a meeting with John on September 30 at 3 p.m. Is that right?

Me: Yes.

Google: Okay, the reminder has been set.

With explicit confirmation, users need to pay attention to the confirmation information before they can take the next step. In some less essential matters, users will think the virtual assistant is stupid.

To build confidence, you can set up a benchmark for different scenarios for important and routine tasks.

User

Please buy more paper towels.

System

[>80 percent confidence, implicit confirmation] okay, ordering more

paper towels…

[45–79 percent confidence, explicit confirmation] I thought you said

you’d like to order more paper towels. Is that correct?

[<45 percent confidence] I’m sorry, I didn’t get that. What would you

like to buy?

Besides, you can also use Non-Speech Confirmation for confirmation. For example, for smart home operations, the user sends a light-on instruction to the virtual assistant, and the virtual assistant sends a prompt confirmation sound as feedback for receiving and confirmation. The light is on, and there is no dialogue assistance.

Some tasks do not have some standard answers, such as asking if the virtual assistant can tell a joke or telling the assistant that I did not sleep well last night.

Me: I am not sleeping very well

Google: I hear that happens a lot to thoughtful people. Try asking me for relaxation sounds or nature sounds. 🌲 🌊 😌

Me: Play some relaxation sounds

We generally do not disclose our emotional status to the computer. We feel strange, especially the artificial intelligence’s cognition of human emotions and sensations is a preliminary stage, but this is precisely a significant direction of VUI in the future. Initial medical diagnosis or knowledge of the user’s mental or health status through daily conversations indeed has much room for development in the direction of Mediation.

4: Dialogue Marker

Marking the dialogue allows users to know where they are. Compared with GUI, the situational user may not concentrate on listening to the conversation, such as asking directions while driving or listening to the banking IVR instruction, and we need to remember our banking account information at the same time.

Without marker

System: Did you take your medication last night?

User: Yes.

System: Goodbye.

With marker

System: I’ll be asking you a few questions about your health. First, how many

hours of sleep did you get last night?

User: About seven.

System: Good job. And how many servings of fruits and vegetables did you

eat yesterday?

User: Maybe four.

VUI: Got it. Last question — were you able to take your medication the previous night?

User: Yes.

VUI: All right. That’s it for now. I’ll talk to you again tomorrow. Goodbye.

Common dialog markers are:

• Timelines (“First,” “Halfway there,” and “Finally”)

• Acknowledgments (“Thanks,” “Got it,” “All right,” and “Sorry about

that.”)

  • Positive feedback (“Good job,” and “Nice to hear that”)

5: Error Handling

We can divide VUI error scenarios into four categories:

No speech detected

It is an observable problem. The way to handle this case is to ‘do nothing’ or respond to the user explicitly that ‘I didn’t hear your response.’

If the app is just a one-way conversation and needs the user to give detailed feedback before going forward to the next step, it is better to tell the user it didn’t hear a response. Otherwise, it is appropriate to do nothing, and the user will try to repeat it in a few seconds.

Speech detected, but nothing recognized

In most cases, it should be voice recognition software mistranslates context, and the computer can’t find a correct answer for you. ‘Do nothing’ or say ‘I didn’t understand your question’ are appropriate responses.

Something was recognized correctly, but the system does the wrong thing with it

Nowadays, the system becomes more human and has enough intelligence to recognize our command that referring to the last turn of conversation. The majority of this error comes from misunderstanding the “Who” or “He/She/It” which the user refers to and also misunderstanding the word that has more than one meaning, for example, “having a cold” vs. “being cold.” The least, the system didn’t handle that type of answer.

I had experienced before when I tried to ask Google assistance, who is the goalscorer of the European Cup Final. The result is quite impressive.

Me: Okay, Google

(Listening)

Me: What is the result of the last game of Manchester United

Google: On August 17, they play Sevilla. The final score was 2–1 to Sevilla.

Me: Who scored for Manchester United

Google: Man United didn’t score (Actually the scorer was B. Fernandes)

Me: Okay, Google

(Listening)

Me: What is the result of the last game of Manchester United

Google: On August 17, they play Sevilla. The final score was 2–1 to Sevilla.

Me: Who is the scorer for Sevilla

Google: Here is the result from ESPN.com (Google listed out the Sevilla’s top goalscorer in this season)

The computer recognizes my voice and understands my sentence. Still, it didn’t have enough intelligence to understand the pronoun (sometimes it is hidden, like my command) and the scorer, which refers to that match.

When the conversation cut into single turns and reviews, it is the correct response. But in a conversation, it is wrong at all. Anticipate various scenarios and connections between turns may help to improve the accuracy of meaning that users narrate.

Something was recognized incorrectly

For native English speakers, it should be okay to repeat the sentence twice, and the system will get it. But for those are not native, speak with a little bit off-tone accent or lazy tone, sometimes might repeat serval times to let the system recognize you correctly. To solve this problem, we need to gathering user samples with data analysis to correct the word and refine the whole sentence by the system. (That’s Google Assistance usually did it for me, My lazy tone :))

6: Give Some Help

Google Assistant provides a guide when it is waiting for user feedback after a few seconds

VUI’s scenarios are extensive. Unlike GUI, we can guide the user in certain graphic constraints like buttons, icons, menu.

It is great to give some guidance to what you can do for the user. Our assistance in the smartphone will appear as a help menu when the user is quiet for a few seconds. If the device doesn’t have graphic support, usually, the VUI will state what sort of things it could help the user at the start and repeat once if the user has no responsibility for around 10 seconds.

In IVR, it would be preliminary and quickly introduce what instruction the user can make before giving detailed instructions. For example, “You can check your account, change your personal information or contact our customer service manager.” say at the very beginning of banking IVR.

7: Not a restart

If the user uses the app frequently, the system could make respond differently and track the previous history data to catch up on what do user says. For example, shorten the instruction, ask for confirmation of previously stored data. Keep updating user data and understand user preference and their patterns. Of course, the system has to ask user permission initially.

8: Disambiguation

Who is Billy?

Places, Names have the same words; the system is frustrated the ‘Who’ again. E.g. “Please call Billy” the system finds out there are two contacts named Billy in the contact list. What should it do?

“Okay, Google, text Billy.”

“There are three contacts named Billy. Which, Billy?” (The smartphone’s screen appear choices who named Billy with different last name for me to choose).

Without the help of GUI, it is hard to let the user process the next step, but we still could respond by allowing the user to know there are slightly different.

“Okay, Google, text Billy.”

“There are three contacts named Billy. The first one is Billy Lee. The second one is Billy Chan. The last one is Billy Ma. Whom would you like to text?”

“The last one.”

“Okay, got it” (Process to text Billy)

Another case is tough to clarify from the system. I could say this is not “its fault.”

“Okay, Google, text sister.”

“All right, what’s your sister’s name.”

“妹妹” (It is Cantonese meaning sister inputted in my contacts, sure the system can’t recognize, then I choose the contact from the dropdown of the screen and the assistance understand who is my sister, it is a good example the system can learn through user interaction with just little effort.)

“Sure, What’s the message?” (Appear an SMS message form field)

“I want to use WhatsApp.”

“Ready to send it?” (“I want to use WhatsApp.” Filled in…)

Conclusion

It is far more complicated than GUI’s UX as that the user is restricted to click buttons. Usually, it caters to various endpoints already. However, in VUI, more likely, the system could focus on one primary step or instruction in single turns. If the user needs to break away from the existing flow while the system is listening to the response, it usually has conflict. Prepare choices for endpoint action can reduce this type of error.

The book from Cathy Pearl guides some methods and software for designing VUI, and there are more examples where we use VUI in ambiguous and barriers and things we have to pay attention to when using an avatar. It is worth reading.

--

--

丹尼爾 Daniel
uxbreakfast

The step by step to learn ux and collaborate. Stay fool, stay hungry!