I recently got to take part in a project at work creating a prototype of a product idea with a voice user interface. We used the Amazon Echo and the Alexa Skill Kit to create an app that demoed some use cases we thought might be relevant to our customers.
The prototype didn’t pull any real data, so the focus of the project was entirely on the user experience with some exploration into what is currently easy, difficult, or even possible with voice right now. From these explorations in designing and building a prototype for a product whose only interface is voice I learned a few things I wanted to share.
Special thanks to Wolf Paulus for his pioneering efforts in voice and for teaching me and my colleagues many of the points listed here.
1. Start simple, then learn from your users.
One of my biggest realizations when I first started looking into voice interfaces was that we didn’t have to get everything perfect from the beginning. Providing a few valuable features is enough to start, and then you can figure out what else is missing by collecting the verbatim requests from your users. If you wanted to create a voice based calculator app, for instance, you would probably be fine starting with the basic operations and replying “sorry I can’t do that yet” to the rest. Then if you notice a lot of users requesting things like “What is 9% of 124?” you can reasonably determine that percentage calculations are important to your users and should be built next. Of course, for this to work the developers and designers of the voice interface need access to the log files of what users requested that could not be processed. Unfortunately, the Alexa Skill Kit doesn’t provide this to developers quite yet, but you can be sure Amazon, Apple and the other big players in voice are collecting this information to make their assistants better over time.
Voice has an amazing capability to discover user needs and expectations in a natural way. In traditional screen based interfaces when a user wants to do something they have to try to find that capability on their own. And as a maker you have very little insight into the tasks users are trying to perform but can’t find in your product. I believe voice interfaces will become smarter and more robust at a much faster rate than apps of the past since feedback of what users want and expect is so readily available for the makers.
2. Listen for key-words, ignore the rest.
At least with the Alexa Skill Kit, the developer of a voice app must give a list of different possible phrases that you might expect the user to say to indicate a specific request. When first starting to build out these lists, you can keep things much simpler than you might expect. Instead of worrying about sentence structure for all the ways someone could ask for something, focus on what keywords could indicate a certain request. For example, someone who wants to know the weather might say “what’s the weather?”, “weather forecast for today”, “forecast for San Francisco” and many other phrases. But the words “weather” and “forecast” and perhaps a few more are the key parts of the request that indicate your app should be getting weather data. Listening for these words is the important part to getting functional interactions into your app quickly. You should list out a few options for sentence structure, but in my experience you don’t need complete coverage if there are specific words that distinguish the different requests you expect from a user.
3. Filtering through lots of options and combinations is a killer use case for voice.
Do you hate drop downs? I know I do. Slots in the Alexa Skill Kit help you listen for certain words in certain locations in a sentence structure. In the weather example you could listen for a city or location from a huge list of locations to get the weather from. But since weather isn’t the most compelling example (weather apps and google do a decent job without voice), imagine setting up a document to print.
Instead of spending a bunch of time getting frustrated with drop downs and hierarchical menus while looking through options you don’t even understand to find what you want, you could just shout at the printer “Hey printer! Print landscape, in color, on the largest paper you have”. And this kind of beautiful thing is possible with the magic of slots and the voice interaction tools available today.
4. Voice in its current state is best for one off requests.
For example, “How many tablespoons in a cup?”, “Add a reminder to get milk”, or “Print landscape, in color, on the largest paper you have”. While this may sound simple, it still involves understanding the user’s environment, what they might ask of your device, and generating keywords and sample phrases to listen for that map to the appropriate action. Good apps will additionally remember context and allow for follow up requests such as “What’s the weather in San Francisco?” followed by “What about tomorrow?” As soon as you require that the interaction between user and device last longer than a simple exchange based on a one off request, things get complicated…
5. Conversations are really difficult to get right.
The design and implementation of conversations in voice interface products is really difficult. We rarely stop to appreciate how complex our everyday interactions with other humans actually are. Open ended questions like “How was your day?” have a mind-boggling number of reasonable responses! Until we reach a very different level of Artificial Intelligence, the best bet for designing a back and forth conversation between a user and your voice interface is to indicate to the user the responses you are expecting, e.g. “Would you like to search by artist, album, or song title?”.
Yes or no questions work well because they feel natural and you can listen for user responses that fit in just three categories (negative, affirmative, or neutral). For example, instead of “How was your day?” a voice interface could ask “Did you have a good day today?” and listen for different affirmatives (“yes”, “absolutely”, “good”) or negative answers (“no”, “terrible”, “not really”). Then if the device knows the user had a bad day, it could say something like “Sorry to hear that. Would you like to listen to your Feel Good playlist to cheer you up?”, again looking for a yes or no reply. This strategy of indicating possible responses eases the burden on the device to deal with an overwhelming variety of input, and also makes it easier for the user to know what is expected of them. Voice is a very new interaction model for many people and it is sometimes very unclear what is possible.
6. It is easy to be self-conscience when talking to technology.
Some people don’t want to even begin interacting with voice interfaces. I’ve used an iPhone for years and almost never interact with Siri. I use my phone all day, usually surrounded by other people, and I don’t want to look like a crazy person or to bother people around me. I think the most successful voice interfaces are designed for situations when the user is alone, such as in the car, or with trusted friends/family, such as when the user is at home and might use an Amazon Echo.
7. Always listening is the future of voice interfaces.
Ignoring the debates about privacy for now, I think it’s clear from a user experience perspective that voice interfaces will get higher adoption if they are always listening. Apple’s new “Hey Siri” feature to wake up Siri without pressing a button is just one of many indications that we are headed in that direction. As soon as a user has to press a button or interact with a more traditional UI on their phone or computer to start voice interactions then the magic goes away. As soon as I touch the home button on my iPhone or the microphone in Google Search it is simpler for me to go to an app or start typing than to say my request out loud. I’ve been well trained to use phone and computer interfaces and to change my habits there must be a significant benefit. Being able to mix cookie dough in my kitchen and ask the Echo to skip to the next song at the same time (instead of interrupting my current task, cleaning my hands, and walking across the room to my phone) is enough of a benefit to change my habits.
8. Voice interfaces will have a personality whether you intend them to or not.
Voice is a very emotion rich user interface. Pretty much since the debut of Siri, people have been falling in love with her or getting angry with her lack of understanding. Users expect a lot out of voice interfaces because they have been designed to sound as much like humans as possible. Even voice based assistants like Siri and Alexa, which are meant to be as neutral as possible in their responses, have a personality just by virtue of the tone and sentence structure of what they say. Both Apple and Amazon seem to understand the power of personality, as they add more and more easter eggs to their products.
As voice interfaces become more ubiquitous, and people begin designing voices for specific tasks and situations, I think it will become even more apparent that the design of voice interfaces is the design of personalities. There are a lot of interesting directions the tech industry could go with this: A voice interface that helps you make a doctor’s appointment could be designed to be reassuring and calming; A voice designed to be your personal trainer and health coach could be reproachful or strict if that’s what works for the user. As more companies get in to voice interaction, personality is something we will all need to conscious of. Designers of voice interfaces need to be careful to add personality in a way that helps the end user quickly get value from the product without reinforcing stereotypes. A tricky line to walk, but worth the challenge.
Anything I missed? Let me know in the comments. I’m still new to voice interface design and would love to learn more.