Voice User Interface and How the Technology Works

Qian Yu
6 min readFeb 5, 2018

--

Product/UX designers need to understand technology’s feasibility and collaborate with engineers to create valuable products for users. In this blog, I’ll explain how the technology works from a designer’s perspective, without digging too deep. 🕵️

Simply put, there are 5 main steps in voice interaction. Even though they are introduced as separate steps, you should keep in mind that these happen almost simultaneously so that the voice AI feels responsive.

1. Wake Word and Wake Word Recognition

A wake word, or wake phrase/command, is the “open sesame” for voice interface integrated products. Just like “hey Siri”, “hey Google”, “Alexa”, you say these keywords to invoke or engage the product.

Why do you need a wake word?

Saying the wake word could be inconvenient, but you don’t want the product to react to everything user says, or everything it hears. The auditory signal could be from the TV, or a question your user asked for another person. It gets annoying for your users to hear “sorry, I’m not sure how to help you with that” five times every ten minutes.

The process of detecting a wake word is called “wake word recognition”. Usually this happens locally on the device, instead of streaming onto the cloud, for two main reasons:

1. Privacy. In order to recognize the wake word, your device needs to always listen out for any auditory input. Not many people would like having their daily conversations continuously “spied” on by a remote company.

2. Timing. It takes a little longer if wake word recognition happens on the cloud. It’s not a lot but definitely long enough for users to notice. Being quickly responsive helps to make your product feel more intelligent.

Besides a verbal wake word, there are other ways to invoke the voice AI, such as the push-to-talk model. Similar to using a walkie-talkie, you may ask a question to a remote person. But in this case it’s a voice AI.

There are two types of buttons, hard and soft. For example, you may hold the physical home button on an iPhone (hard), or tap the visual symbol to invoke Siri (soft).

So which would you choose, a verbal wake word or a push-to-talk button?

A well functioned verbal wake word requires a lot of training to decrease the chances of “false positives”, meaning your users didn’t mean to invoke the device but it mistakenly reacted. Creating a good verbal wake word also takes collaborative effort, which usually involves branding and marketing teams, as users will think of the wake word as the AI’s identity.

Compared to a verbal wake word, the push-to-talk model would be a safer choice. It won’t cause as many false positives if the button is well designed. Plus, saying the wake word every time you want to initiate a conversation isn’t a natural behavior of human interaction, same as you wouldn’t always call out a friend’s name before you talk to him. Even my coworkers some time forget to say the wake word, and they are the specialists who work in this field. Experienced users may prefer to press a button and give a command if the device is within reach.

2. Users saying a command

The voice AI expects a user’s command after recognizing a wake word/action, such as “… set a timer for 15 minutes”, or “… play Spotify”. This auditory signal is collected and sent to achieve the next step.

Don’t forget, not everyone knows what they can say after invoking the voice AI. A user might get clueless or too self-conscious to say anything. Now is a good time to kick in your well designed “first time experience”. I will talk about this further in another post.

3. Analyzing the command

In order to understand what the user is saying, the auditory signal needs to be translated into texts by Speech To Text (STT). This is then sent to the Automated Speech Recognition (ASR) and Natural Language Understanding (NLU) systems on the cloud for further analysis.

These terms may vary depending on which ASR or NLU vendor you use, but the logics are usually quite similar.

The auditory signal is now translated into a string of texts. In this example (see above diagram), the AI first understands that the user asked something about a timer. “Set” meaning creating, instead of canceling, and “15 minutes” being the duration of the timer.

A domain is normally smaller than or similar to the feature. For example, a timer domain may be able to fully support all the commands a user would ask about a timer, e.g. creating or deleting a defined timer. An emailing domain may contain intents (see step 2 above diagram) such as, read or reply to an email. Its entity may include an email’s title and sender. Some features may require more domain support, for e.g. a feature like writing a new email. In order to understand who is the email’s recipient, a contact domain might also be required.

4. Prepare the answer

Whether or not your voice AI understands the command depends on your ASR and NLU’s capabilities. If a user asks for something within the product’s scope, now is the time to collect all the necessary info to move forward.

A question like “what is one plus one?” sounds super easy for a human. However, if the product doesn’t support any math related domains, then it simply can’t answer it. An error message would have to be played to your user, such as “I’m not sure how to help you with that”. You may want to design a few variations and be creative, so it feels a little less disappointing for users.

If your users often ask questions from certain unsupported domains, it might not be a bad investment to develop the surface for these. You don’t need to develop the backend thoroughly but you do need to have a good enough understanding of the kind of tasks that would be asked. For example, if your user says, “tell me a joke”, an answer like “I’m not very good at entertaining people” would sound less disappointing than “I’m not sure what you mean by that”.

5. Text To Speech and Execution

Now your voice AI has got everything it needs to respond. Normally, the answer would be designed in the format of texts and spoken through Text To Speech (TTS). You may also record some canned messages as replies, just make sure the voice and tone are consistent with the TTS.

Continuing with the previous timer example, the voice AI now would use the TTS to play a response similar to “got it. Your 15 minutes timer has been created” and usually this would be the last step.

But if the task couldn’t be completed, the voice AI would need to ask follow up questions. For example, your user might just implicitly say “set a timer” instead of specifying the duration. Then it’s the voice AI’s job to anticipate the user’s needs and ask for the missing information to move the process forward.

--

--

Qian Yu

UX Lead for Webex Assistant, a multi-modal conversational AI at Cisco Collaboration. 👨🏻‍💻 www.thousandworks.com