Designing for Voice — Part I

Baisampayan Saha
Go-MMT Design
Published in
12 min readMar 21, 2018

--

This article is based on one of the workshops of Bruce Balentine on “Designing for Voice User Interfaces”. The article mainly talks about principles used in designing IVRs. But I think many of the principles can be used for designing modern VUIs also.

Before we do a deep dive into various nuances of designing for voice interfaces, lets first understand how devices that are purely voice interfaces or are multi-modal (voice + visual feedback) work.

Diagram explaining how voice user interfaces(VUIs) work.
  1. The user speaks to the device.
  2. The device mic picks up the speech and sends it to the cloud servers. The heavy lifting of speech recognition is done on the server side. This system is known as Automated Speech Recognition (ASR).
  3. After speech recognition, the reply is then converted from text to speech
  4. The output is then sent over to the device as speech. If the device is multi-modal than the device may include visual cues and feedbacks.

Instructions

But most of the voice interfaces are instructional and are considered as personal voice assistants. When talking to a person, even though one can interrupt another person from talking and start talking himself/ herself, the other person understands the context and replies accordingly. But in case of voice interfaces, can they understand the context accurately and respond accordingly? We will come to it later on how VUIs decipher human speech/ queries.

At an average, a person can recall a maximum of 5–6words in his/ her short-term memory when a list of words are spoken. In most cases, the first word is recalled the most as it gets the most time to “rehearse” in mind. The last few words are recalled also as they were freshly heard. It has been found that few words from the middle also find their way on the list. So while designing for voice interfaces, we must remember that when the interface speaks out instructions of any tasks to the user,

  1. It should not be more than 5–6 steps. The less number of steps, lesser is the cognitive load on the user.
  2. The more important entries in a list are to be put in the first, last & the middle section, in order of priority.
  3. The instructions should also be succinct and clear.

Let’s see an example of how to structure an instruction. Below is an announcement for delay of a particular train. (Adapted from Bruce Balentine’s workshop manual)

“Attention, passengers travelling to Guwahati on Rajdhani Express, departing from Bangalore at 12:15 pm today and arriving at Guwahati at 04:30 pm. The train is delayed. New departure time is 01:15 pm with arrival time of 05:30 pm at platform number 2. Please standby for additional announcements. We apologise for any inconvenience.”

Let’s remove prepositions, unnecessary verbs & facts from the passage above.

“Attention, passengers traveling •• Guwahati on Rajdhani Express, departing •• Bangalore •• 12:15 pm today •• arriving •• Guwahati •• 04:30 pm. The train •• delayed. New departure time •• 01:15 pm •• arrival time •• 05:30 pm •• platform number 2. Please standby •• additional announcements. We apologise•• inconvenience.”

Let’s now restructure and organise the text again.

“Attention, passengers traveling •• Guwahati on Rajdhani Express, ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••The train •• delayed. ••••••••••••••••••••••••••••••••••••• New departure time •• 01:15 pm •• ••••••••••••••••••••••••••••••••• arrival time •• 05:30 pm •••••••••••••• •••••••••••••••••••••••••• platform number 2 ••••••••••••••••••••••••••••••••••••••••••••••••”

If you notice, we have removed quite a bit of text from the passage. We have considered that there is only one Rajdhani train going to Guwahati on that day, so there is no need of telling the user the current departure and arrival time. The user is already aware of it. And lastly, we have removed the last two lines. No need of saying information which is not required at that time frame. It would only increase the cognitive load.

Let’s refine the above text one more time.

“••••••••••••••••••••••••••••••••••••••••••••••••• Rajdhani Express, •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••• •••••••••••••••••••••••• delayed.••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••• ••• departure time •• 01:15 pm •• •••••••••••••••••••••••••• arrival time •• 05:30 pm ••••••••••••• •••••••••••••• ••••••••••••••••••••• platform number 2 ••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••”

Now if we arrange the information above again,

“Rajdhani Express, ••••••••••••••••••••••••••••••••••••••••••••••• This is a delay announcement ••••••••••••••••••••••••••••••••••••• Departure time •• 01:15 pm •• •••••••••••••••••••••••••••••••••••• Arrival time •• 05:30 pm ••••••••••••••••••••••••••••••••••••••••••• Platform number 2 ••••••••••••••••••••••••••••••••••••••••••••••••”

The dots can be considered as white spaces or pauses. So whenever designing for voice:

  1. Write down the text first what you intend the interface to say and then remove the clutter by removing unnecessary prepositions, verbs or facts
  2. Try to read numbers one digit at a time whenever possible
  3. Add plenty of pauses or white spaces between each set of information. Pauses in-between sentences should be a minimum of 200 milliseconds

List

There would be many instances when the interface has to speak a list of items for the user to take action. The list of items may have smaller lists nested inside. So how do we design for lists?

Say we have 20 words. One of them is ATM. The first step would be chunking the list of words into groups of similar items. Below are 3 variations of list derived from the 20 words.

Now let’s narrate the list to different users and ask them - “ in under which word from the list would they expect to find ATM”. This would help us know which list is working better. If users are not able to relate to any word in a list with ATM then we need to work on the words again. Another way of solving would be card sorting of the list of words and come to a conclusion.

If you check the length of the list, it is not more than 5 words. This is particularly important if it is purely a voice interface, as the human mind can not remember many words when they hear it. If the list has to be long, then there should be a mechanism by which user can go back to the start of the list and hear it again, or stop the VUI and repeat from a specific section.

If it is a multi-modal interface, the user can see the list. The voice assistant can then pin-point directly to that particular list. If the user is finding it difficult to follow, the voice assistant should be intelligent enough to prompt again with instructions.

Let’s take a hypothetical example of a VUI inside Photoshop. Below is a screenshot of Photoshop CC starting screen. The user has asked for help in creating a new document. The GUI is highlighting the button and complementing the VUI.

Screenshot of Photoshop UI highlighting a button and complementing the VUI.

The screenshot below would be a perfect example of lists in a GUI. If you see the right-hand side, there is a list of items that user must input to create a document. The user might not have a conceptual understanding of the terms mentioned in the list. In such scenarios, let the user ask for help with regards to a specific item. Let’s consider the VUI’s name is WIKI. The user does not know about “resolution”, the user can ask the VUI — “WIKI, tell me more about resolution”, rather than VUI explaining each and every term in the list.

Screenshot of Photoshop UI for creating a new document

n-way Branches

A particular type of list is shown in the above image. These types are called n-way branches. These types of lists are used always in full duplex conversation (more on that in the next section) only as the system has to always listen for user replies. In a two-way n-way branch conversation, two entries are spoken together and then there is a smart pause and then the next item is spoken. As shown in the image, “Flight or Train” would be spoken first and then a smart pause and then “New or Old” would be spoken followed by another pause and the next item. We can go up to three-way branches effectively without many errors popping out because of cognitive load on the user. But when four-way branches are used, please keep in mind that list items should be such that they can be easily recollected or are very familiar items in the day to day life. Otherwise, the chances of errors creeping in would be huge and may cause system redundancy.

Conversations

Conversations are of two types: Half-duplex and full duplex. In half-duplex conversation, each speaker takes a turn in speaking. When one user speaks, the other user stops and listens. In full duplex, one user can interject the user who is speaking and start speaking and vice versa.

The image above depicts a conversation between a user and a VUI. In half-duplex conversations, each user takes a turn to speak. Interjections are not possible as the window for user input is specific to a certain time only. In the diagram above, the black bars are conversations between user and VUI. The grey blocks indicate the short pauses between conversations.

Bear in mind, pauses are the only time when VUI is receptive to user input. However, in a full-duplex conversation, these rules don’t apply. A user can interject the VUI anytime and VUI is always receptive to user input. These pauses are also called smart windows.

The VUI speaks one item at a time from the list followed by a short pause, allowing the user to respond back. This pause can also be marked by a beep to let the user know that it’s time to speak. When the user starts speaking, the automated speech recognition system (ASR) starts simultaneously and stops either when the speaking window is over or the user has stopped speaking. This is usually in case of half-duplex conversations.

In full-duplex conversations, the ASR is always on. The reason for stopping the ASR when the VUI speaks is that, it would take its own voice as input and which could lead to completely hilarious and unnecessary results.

An example of the full-duplex system would be when the user uses a headphone while speaking. The user speaks into the mic of the headphone and when VUI speaks, it would be heard in the headphones, so VUI’s ASR would not pick up its own speech. So when the user connects a headphone, if the device has a voice assistant, it can instantly switch to full duplex conversation mode.

Diagram showing a VUI speaking out a list to the user

The image above is an example of the VUI speaking out a list of actionable items that the user has to choose from. There is a smart pause for the user to respond. Let’s assume the user says “Trends in technology”. But the ASR could only understand “trends”. The various scenarios are —

  1. The user says “trends in technology”. It must have been said during the smart pause or after few other items in the list.
  2. Since the only item in the list having a word “trends” after user spoke is “trends in technology”. Ask the user a yes/ no question — “Did you say trends in technology?”
  3. If the user says “yes”, tell the user about it. If the user says “no” then ask another yes/ no question regarding the other item that starts with trends.
  4. If the user is still confused and does not answer, then after a smart pause, tell the user about both the entries — “Which one did you choose — Trends in fashion or trends in technology?”. Do remember to put pauses while reading out the list items.

Always remember, in a “Yes/ No” question, if the machine has to re-confirm, it should never ask “Is that a No?”, even if the user replied in “No”. Always ask “Is that a Yes?”

Let’s take another scenario. Instead of a particular item, the user says “that one”.

Diagram showing a VUI speaking out a list to the user along with an user input

In this scenario, the user responded when the VUI is speaking “Trends in technology”. There is a delay in the response of the user. This is called recognition window. This is the time that the VUI takes to comprehend what the user has said. So which one does the VUI choose? Few possibilities are —

  1. Since user responded when VUI was speaking the 3rd item in the list, the user must be either talking about the 2nd item or the first item.
  2. To confirm, we would first ask a yes/ no question regarding the 2nd item, which was said just before the 3rd item — “Did you choose headlines?”
  3. If the user responds by “yes or no”, proceed accordingly.
  4. If the user does not respond, ask user another question, now asking about both the entries after waiting for some time, allowing users to answer the yes/ no question that was asked before.

While designing for voice, it important to test out the rate of speech, i.e. how quickly the Text-to-speech (TTS) engine speaks to a user and the pauses. Test out with various users to find the optimum pace of the speech. It should not be too slow or too fast.

If there is a GUI for the voice assistant, it would be nice to give controls to the user where the user can manipulate the pace and length of the pauses of the speech.

Recognition window

Diagram showing a VUI speaking out a list and user responding to it

The image above depicts a speech where the VUI is speaking out 3 options. All the three options are spoken together but with a very small pause between each word. The VUI is asking the user to choose one — “Choose one from the three options — First class, second class or general”

Keep in mind this is a continuous speech so this has to be a full duplex conversation as the ASR has to be always on to be able to understand user response at any time.

The user might say “Yes, that one, first class”. Since the response has “first class”, it is sure, the user chose the first item.

The user can also anticipate that there may be another class called “3rd class” and say, “3rd class, that one.” During initial stages, the VUI should ask a “yes/ no” question to verify about the last item, but if there is proof that many people are responding similarly, then the words “3rd class” should be included in the parameters or grammar of ASR, when these set of words are spoken, it should map to the “General class”.

In another scenario, let’s assume the user first says “that one 2nd class…no..no…choose first class”. In such scenarios, if there is a “no” in the speech, then do not consider the 1st matched item, choose the item spoken after “no”.

Grid based layouts

A grid-based layout can also be used for a voice system. This system was explained quite effectively in Bruce Balentine’s workshop.

  1. When the VUI is speaking to a user, the length of the whole speech is divided into 3 parts.
  2. Each part is now divided into 10 equal parts which are represented by a very low tick sounds.
  3. At the start of each part, the VUI says “Start”, “Middle” & “Last” accordingly.
  4. The pace of the ticks depend on the length of the speech. For example, if the speech is long, the ticks may be heard after a longer interval of time and if the speech is shorter, the ticks would be heard in a short interval of time.

If visually seen, it may look something like this (the dots represents pauses) —

Short Speech — Tick……Tick……Tick……Tick……Tick……Tick……Tick……TickLonger Speech — Tick………….Tick………….Tick………….Tick………….Tick………….

With few sessions of listening, users would now be trained to understand, if a reply from the VUI is shorter or longer. Such system can be effectively used when users ask VUIs to read out messages and emails. The system can have smart windows after each section (First, Middle & Last) to respond to user replies. This can be quite effective when the user feels a message is too long or unimportant and wants to skip to the next message or may pick up data from a section or repeat a particular section again.

Find the link to the Part-II of the article below.

Designing for Voice — Part II

--

--