Photo by Florian Klauer on Unsplash

4 Principles of Voice User Interface Design

Kathya Sarria
NicaSource
Published in
6 min readJul 27, 2020

--

Conversational Design

Cathy Pearl, head of conversation design outreach at Google defines conversational design to mean thinking about an interaction with a VUI system beyond one turn.

Humans rarely have conversations that only last one turn. Design beyond that one turn; imagine what users might want to do next. Don’t force them to take another turn, but anticipate and allow it. Having a conversation with a system that can’t remember anything beyond the last interaction makes for an unintelligent and not very useful experience. A good rule of thumb is to let the user decide how long the conversation will be.

Here is an example of a conversation beyond one turn, notice how the conversation kept in context :

USER:
Ok, Google. Who was the 16th President of the United States?
GOOGLE:
Abraham Lincoln was the 16th President of the United States.
USER:
How old was he when he died?
GOOGLE:
Abraham Lincoln died at the age of 56.
USER:
Where was he born?
GOOGLE:
Hodgenville, KY
USER:
What is the best restaurant there?
GOOGLE:
Here is Paula’s Hot Biscuit:

Google showing the best restaurant in Hodgenville, KY

Setting User Expectations

Five ways we can set user expectations :

  1. For a case in which you’re seeing a lot of “yes” responses, you might want to consider rewording the prompt to something more clear, such as “What would you like to do: send it or change it?”
  2. It’s important to set user expectations early on. How does your app introduce voice? You can offer a “tour” to first-time users, and provide educational points along the way.

As Margaret Urban, Interaction Designer at Google says: When someone has successfully completed a VUI interaction, it’s a bit of an endorphin boost — the user glows completion and satisfaction.

  1. Urban offers a good analogy about designing with breadth. Perhaps you’ve designed a system that allows people to set an alarm, but you didn’t give them a method to cancel it. She likens this to giving someone a towel for a shower, but no soap. If you set an expectation that you can accomplish a task, think about the corresponding (symmetrical) task that goes with it.
  2. When asking the user for information, it’s often better to give examples than instructions. If you’re asking for date of birth, for example, rather than say “Please tell me your date of birth, with the month, day, and year,” use, for example, “Please tell me your date of birth, such as July 22, 1972.” It’s much easier for users to copy an example with their information than translate the more generic instruction.
  3. Another way you can violate turn-taking is by asking the question before the system has finished speaking. For example, a common IVR structure is, “Would you like to hear that again? You can say ‘yes,’ ‘no,’ or ‘repeat.’” Users often begin to speak as soon as the question has finished, which leads to frustration because either they can’t interrupt, or they interrupt just as the system begins the next sentence, stop talking, and have interrupted the flow. With good prompt design and very careful voice coaching, it is possible to make this work, but in general, you should avoid it by putting the instruction first and the question at the end.

Sample dialogs

Sample dialogs are not just a way to design what the system will say (or display) to the user; they are a key way to design an entire conversation. Designing prompts one at a time often leads to stilted, repetitive, and unnatural-sounding conversations.

Steps:

  1. Pick five of the most common use cases for your VUI, and then write out some “blue sky” (best path) sample dialogs for each case.
  2. Besides, write a few sample dialogs for when things go wrong, such as the system not hearing the user or misunderstanding what they say
  3. When you’ve written a few, or even as you write, read them out loud: often, something that looks great written down sounds awkward or overly formal when you say it.
  4. Use a tool to write the script: A great tool for this is the screenwriting software Slugline, but any place you can write text will do.
  5. After you’ve written some sample dialogs, a very useful design exercise is to do a “table read”: read it out loud with another person. Another great use of sample dialogs is to record them, either using voice talents or text-to-speech (whichever will be used by your system). It is a slightly higher cost than simply writing them, but an even more powerful way to know if the design sounds good before investing in more expensive design and development time.

Disambiguation

There can be times when the user provides some but not all the details to take action. For example, the user might ask for the weather for a location that exists in more than one place: “What’s the weather in Springfield?”

Here are three ways we can implement disambiguation :

  1. If possible, rely on any known information to determine the answer without having to ask the user. For example, the Amazon Echo requires the user to specify the home location as a part of the setup; thus, when you ask, “What’s the weather?” Alexa produces local conditions automatically.
  2. Other contextual clues can also be used. If the user just looked up a restaurant in Springfield, Illinois, and then asks, “What’s the weather in Springfield?” you can pretty safely bet that they mean the one in the location they just referenced.
  3. If no contextual information is available, the system will need to ask the user to clarify:

USER:
What’s the weather in Springfield?
SYSTEM:
Did you mean the one in Illinois or Maryland?
USER:
Illinois.
SYSTEM:
It’s 65 degrees…

If the system has high confidence for the word “Springfield,” it can use the reference word “one” rather than explicitly stating the name again. Also, be sure to allow the user flexibility in their response: the user should be able to say “Springfield, Illinois” or just “Illinois” or even “the first one” (imagine a list)

Disambiguation might also be required when the user answers with more information than your VUI can handle.

SYSTEM:
What is your main symptom?
USER:
I’m throwing up and I have a fever.
SYSTEM:
OK. Which one would you say is your primary symptom?
USER:
Uh, my fever I guess.
SYSTEM:
OK, fever…

Interaction Should Be Time-Efficient

When designing visual experiences, we try to limit the number of clicks a user must take to complete an action. The more clicks, the more cumbersome and tedious the experience feels. The same holds for voice-driven interaction, as well. Imagine asking a user for their address:

APP:
What is your street address?
USER:
1600 Pennsylvania Avenue
APP:
What city?
USER:
Washington
APP:
What state?
USER:
DC
APP:
And what is your zip code?
USER:
20009

In this example, the user must go through four calls and responses before completing a single task. Now compare that to a single interaction:

APP:
What is your complete address?
USER:
Pennsylvania Ave, Washington, DC 20009.

This time there is only a single interaction. Although you might want to have a confirmation prompt (“I heard you say 1600 Pennsylvania Ave. Is that correct?”), there are still half as many interactions to complete the same task, making the design feel more responsive.

References:

  1. Pearl, C. (2016). Designing Voice User Interfaces: Principles of Conversational Experiences.

--

--