Can VUI Technology Meet Its Market Demand?

Published in

Quango Inc.

5 min readJun 1, 2018

Part 2 in our series focusing on how voice user interfaces are changing the marketing and advertising landscape features a Q&A with author and VUI expert Cathy Pearl. Read Part 1 here.

Smart speakers are the latest craze in digital technology. One in six Americans owns a smart speaker — an increase of 128% since January 2017. In the first part of our series on voice interface (VUI) technology, we explored how voice-based technologies are being introduced in industries like healthcare, education, hospitality, and transportation. But, for all of VUI’s innovative application potential, are we putting the cart before the horse?

Recently we spoke with Cathy Pearl — VUI designer, engineer, and author of Designing Voice User Interfaces: Principles of Conversational Experiences — to gain insight into the strengths and weaknesses of voice-based technology, and what role she thinks VUI will play in the coming years.

At this stage of VUI development, what are some of the biggest issues that need to be addressed to improve user experience?

Two things. One: A lot of these systems are designed by developers or designers unfamiliar with voice, which means they aren’t always familiar with best practices. We then end up with VUIs that are difficult to use. The other obstacle is discoverability. Amazon Alexa, for example, has something like 30,000 skills. Who knows what they are? Which ones are good? Even when I’ve activated a skill on my own, I have trouble remembering the correct series of words to invoke it. Within a skill or action, it can be difficult for the user to know what they can (or can’t) say next. Designers need to carefully craft conversations to make it clear how users should complete a task sequence.

What is the ideal conversational design sequence between a user and a VUI device?

The main thing is to make it super clear what the user needs to do, or what they can do, at every step. There’s a fine balance, however — we don’t want to build an interactive voice response menu monstrosity where we reel off [too many] options. You want to make the most common flow clear, and make help options available when your system doesn’t understand—which will always happen. We can’t skimp on design for when things go off track.

Also, let the user determine how long the conversation will be. If I ask my Google Home, “What’s next on my calendar?” and it responds, “Dentist appointment,” I should be able to follow up with additional questions like, “How long will it take to get there?”

Can voice recognition engagement operate solely on its own, detached from visual- or text-based support?

It’s absolutely possible to have rich, satisfying conversations that are voice-only. There are simple tasks (like turning on the lights), medium tasks (such as reordering cat food), and complex tasks (like engaging in small talk) that can be great voice-only experiences. Sometimes, however, a visual assist is needed — especially if there is a lot of information to share.

How can data improve VUI’s development, reach, and reliability?

Data can help in two ways. First, it can help systems understand different types of voices and accents. Right now, the most data collected is for white males, so the speech models work best for those users. We have historically lacked data for children’s voices, or voices with regional accents, so those users are less likely to be understood. The more data we get, the better the technology will work for everyone. Step one is simply to understand the words a user speaks.

Step two: What now? It’s important to realize that humans ask for things in different ways. Everyone has their own narrative, even if it’s just to order pizza. As designers and developers, we need to realize that the way we think people will respond is not, in fact, always the way they will respond. When we look at logs of what people say to their systems, we can build a system that understands the ways people speak. For example, if I say, “When do I have to leave?” I may get a response like: “October 2, 2018,” or “next Tuesday,” or even “tomorrow.” The speech recognition interface worked perfectly. But now we need it to interpret the best way for it to answer.

What are your thoughts on the type of voice used in voice-activated devices? Are certain genders or tones viewed as more satisfying?

There is a misperception that in order to be a helpful assistant, a voice must be female. Most of the popular assistants use a female voice (although some can be set to a male voice instead). However, the most important factor for a voice is to be clear and understandable. In addition, you want your voice to match your brand or message. If you’re using VUI technology in healthcare, for example, [it doesn’t matter if] the voice is male or female. What matters is that it is caring and empathetic.

How far away are we from having personal conversations with a VUI system?

In some regards, we’re already there. I have personal conversations with my Amazon Echo and Google Home — they know a lot about me. We’ve asked Amazon Echo to sing “Happy Birthday” with us. My son asks for help with his homework. And our users at Sensely will tell our avatar, Molly, when they’re stressed out.

But if the question is, “How far away are we from having a conversation indistinguishable from one with a human?” (i.e., passing the Turing Test), then we have a long way to go. VUIs are not good at context, switching subjects, or remembering what you talked about last week. For now, we are working to make conversations more pleasant and humanoid —but we’re a long way from Samantha in the movie Her.

Got an idea for your own VUI marketing campaign, but need help getting to the finish line? We’re here for you.

Interview has been edited for clarity and length.