Intro to Designing for Voice
What you need to know to start designing voice user interfaces
Voice User Interfaces have finally arrived. The technology has reached a tipping point of being good enough in recent years. Depending on the market, takeoff for systems making heavy use of the technology is either in full swing or right around the corner. Whatever implications and predictions you believe in for how big or important they are going to be, it’s a new mode of interaction that is going to enhance communication with pretty much all connected devices.
There is a high chance, depending on your product and it’s contexts of use, that voice will enhance or even replace your visual UI in the near future. In fact, most products will offer many modes of interaction as gateways for the service, some may be screen-based, some just voice, and some a mix of the two.
We at USEEDS are fully aware that voice interaction is going to have a profound impact on any digital service. And subsequently on anyone working on these services and products. Being ready for the imminent future of ambient computing, as a designer, means being able to present relevant content through the means, media and modalities that fit the users situational and timely context in any given moment.
This, in a way, can be seen as the ultimate responsive design. It’s not about choosing voice for the sake of itself, but rather to always be ready with the best mode of interaction for a given use case.
That’s why in the past months, as well as the ones ahead, my main focus at work is on designing for voice interaction. While I’m super excited about this, getting started in a field you know close to nothing about can feel a bit overwhelming at first. I’ve gotten into the groove a bit now and can confidently tell you that it’s still really exciting and an incredible challenge, but not nearly as scary as it seems at first. In order to give anyone an easy intro to the field and wet your appetite for more, I’ve put together a brief overview of the most important things to know when you want to start designing for voice.
Structure of Voice Platforms
You could, of course, start by building your own AI assistant. But for the sake of simplicity and applicability, we are going to focus on piggybacking on existing platforms from major players. Amazon, Google, Apple and Microsoft all have their own voice assistant that they are trying to spread far and wide. By building apps on their platforms, we can make use of their built-in smarts, huge scale for distribution and continuously improving system. Plus, in efforts to persuade you to build for their platform, each is giving you tools to make getting started easy as pie.
Before you can get going, though, it is important to understand — roughly — how these voice platforms are structured. Bear with me here, it’s not that complicated. There are three parts to the system. The device you’re talking to down on earth, plus two layers for handling data in the cloud.
The device that is used for the voice assistant — no matter if its a dedicated speaker in your home, your smartphone, TV, fridge, car or whatever — acts as nothing more than a mic, a speaker and a small processor for hotword detection. Once the user activates it and speaks their command, the device sends it to the assistant service in the cloud. The assistant layer takes care of all the heavy lifting, things like voice transcription, natural language understanding, etc. and delegating to other services. Once it figures out a command is meant for your app, it transcribes the voice input into a string and sends it over to your app. The app uses this input to do whatever it does — calculate something or grab some more information — and finally composes a fitting reply to present to the user. The app sends this reply off to the assistant, which in turn sends it back to your device. The device then, you guessed it, spits out the reply and eagerly awaits its next instructions.
What that means for us is that we don’t have to take care of the dirty details of picking up the users voice and making sense of their words. Instead, we can focus on getting the logic and interaction bits right.
Any area of expertise you are new to comes with its own special set of words and abbreviations. Though often necessary, this jargon makes it even harder to get into the field. As if all the deep knowledge you’re missing wasn’t enough, all these strange words — that may or may not be super important — make everything seem much harder than it actually is. To get this hurdle out of the way, here are the most important things you should have heard about:
- VUI: abbreviation for voice user interface. Describes an interface in which voice is the primary mode of interaction.
- CUI: abbreviation for conversational user interface. Describes an interface in which the interaction occurs in a discursive, continuous dialog between the user and the machine. A CUI should also abide by the principles of conversation, most notably the cooperative principle. It may or may not use voice as a mode of input and output.
- NLU: abbreviation for natural language understanding. Describes the ability of a machine to organically map naturally spoken queries to the correct intent.
- Skill: The name for apps running on top of the assistant platform by both Amazon and Microsoft. Google calls them Actions.
- Intent: A functional unit containing multiple possible paths of conversation for completing a user request. In a way, it’s what the user wants the assistant to do when triggered by a certain request.
- Utterance: A certain phrase a user can say to invoke a specific intent, answer a question, etc… You’ll define utterances to train the natural language understanding engine to recognize a broad set of related inputs.
Ok, now that you’re up to speed on the basics, let's get to the process of actually designing a voice app. There already is a ton of valuable information out there, most notably the Design section for Actions on Google. I recommend you to check it out! Following is a brief summary on a 5 step process that turned out to work really well for us at USEEDS. I will stay more high-level for now, so that it’s easier to get a good overview quickly, and will describe each step in more detail in a separate post later.
1. Use Case
Finding the right use case for your app is undoubtedly your most important task. It’s what can make the difference between heavy use and irrelevance.
Let me illustrate that with an extreme example: If an app does something that is highly useful for you, you will probably use it and try to make it work even if the UI is horrible. In contrast, imagine an app that has a gorgeous UI, but essentially does nothing. Would you use that?
The use case essentially describes what a user wants to achieve with your product and why. The important thing to remember is that the use case describes the user’s situation and intention, the why, but ignores the way it is achieved, the how.
The Job Story
It makes sense to describe a use case in the form of a job story. This format puts emphasis on the situational context of the user. That context directly influences their motivation and fuels their special needs in a solution. Here is an example of how to frame a job story, borrowed from Alan’s article:
This focus on context makes the job story especially well suited for VUIs. That is because VUIs enable a state of ambient computing in a way that has not been possible with traditional UIs and devices. In that state, seamless access to computational services is given at any time, any place and in any way. And that is where VUIs can really shine. To assess whether your use case is any good, I’ve found three important criteria to check against as a reference.
- Convenience: A user can now do something easier, faster or more efficient than before. Using the system takes physical or cognitive load off of them.
- Accessibility: A user can now do something they could not do before. That may be because they need their hands and motor skills for another task, such as driving or cooking. It may also be because of a physical handicap, such as visual or motor impairment.
- Frequency: How often a user wants to do something. If a use case is super convenient and accessible, but you only need it once a year, the effort for setting the system up might outweigh its benefits.
2. Assistant Persona
Once you know what your app will do, it is important to define who it is speaking on your behalf with your users. The assistant persona, in a way, is analogous to the visual appearance of a graphical interface. Getting it right makes the difference between a smooth experience and a feeling of mistrust and doubt. Obviously, that is not what we want!
To define a fitting persona, start by giving it a name. Choose a real name, such as Alexa, or a descriptive one like Number Genie, whatever fits your brand and app use case. You will not need to refer to that name throughout your VUI — although you can — but it’s invaluable for the design process. For the sake of having a complete picture, give your assistant a little backstory. What is its job? What’s its personality like? How do people perceive it?
To figure out how it will actually speak you start by collecting a set of attributes that your brand is known for among your customers. Next, imagine how these attributes translate to speech properties of your assistant. How would they talk, what kind of words do they use, what speed or tonality would they talk in? Write down everything you can think of in a style guide, ideally including Dos and Don’ts for illustration. It’s always easier to transfer from concrete examples than to comprehend from abstract descriptions.
Once you have a well enough idea of who your persona is, give it an avatar to make it even more tangible. All this will help tremendously with trying to put yourself in its position when writing the dialogs.
3. Scripting, Acting & Intents
Now it’s time to get to the actual meat of the VUI: the dialogs. You’ll have to define what the user can say, as well as what the machine might reply. The process here is very similar to actual screenwriting. Based on a certain setting of contextual preconditions, you’ll sketch out a turn by turn dialog with two partners: your user and the assistant.
Now our work on the assistant persona comes in handy, as it helps us imagine the assistant as an actual person full of motivation and intentionality behind expressions. Without that, it’s all too easy to fall back into a more robotic, less natural scheme, since this is generally easier. It’s kind of funny how something that comes naturally to us in conversation all of a sudden becomes really complicated when you’re trying to analyze and construct it.
To get started, just jot down a hypothetical dialog, off the top of your head, for one of the ways your conversation could go. Sketch out the happy path where everything goes according to plan. Both partners are understanding each other well. Both are cooperatively pushing the conversation forward to an efficient and successful end.
Once you got a version of the ideal flow down, repeat the process over and over, exploring variations, alternative flows and all sorts of error paths. These can include some related or additional functionality, missing input data, unclear user input, etc… You’ll get the hang of it and come up with ever more during the process. This gives you a good sense for all the situations your app should be able to cover. And subsequently about the logical structure of your app.
Once that you got all of the important dialogs down on paper, it’s time for action! Grab a partner, or even ask a duo for help, and act out the conversation like in a screenplay. This step is crucial because it helps you simulate what your written words will sound like. And that’s often weird, too long or much more inelegant then you first thought. Written language is just really different from the spoken.
Going through these dialogs multiple times, making adjustments to the wording and iterating on it, just makes the outcome so much better. I know it’s uncomfortable and you might feel silly reciting a 3 line dialog over and over. But trust me, it’s worth it!
Now that we know what interactions we want to cover, we can derive the technical intents we need to achieve that. You do that by going through your scripts and analyzing if a dialog is just a special case of an existing functionality, or if it is indeed different, additional functionality. Every functional unit forms its own so-called intent.
At that stage, it makes sense to map out a functional diagram of your skill. Map out every individual intent as a tree structure of possible errors and detours to eventually reach the user's goal or direct them to another intent.
4. Refine and Diversify
Now, our job is to take our dialogs and bring them into a format that makes sense for the machine and will produce effective and engaging conversation. To do that, split up your dialogs in utterances and replies. Everything the user says is an utterance, while the system output are replies.
Within one turn of a dialog, we now have one specific utterance and one specific reply. While that works when you know the exact script, it’s not very helpful in real life. It’s just that people don’t strictly talk the same.
Imagine that five people have essentially the same informative question to ask you. Depending on their background, personality or even mood, each of them will ask the question in a slightly, or even completely, different way. Yet they all want the same answer. Formality, content and even sentence structure may be completely different.
Equally, when you reply the same bit of information to each of them — even though your personality stays the same — your wording will always be slightly different. That’s what makes interaction natural and not feel pre-programmed. And natural is what we’re aiming for.
For both, utterances and relies, find and write many variations of the same expression. Utterances should vary greatly, covering the biggest possible spectrum of wording. These are used to train the NLU algorithm to eventually recognize the correct intent no matter how the query is formulated. Diversifying replies follows the same ideal of making conversation feel natural, though it’s spectrum is much less eclectic. Write a multitude of interchangeable replies for every turn in the conversation, but always remember to stay in character of your assistant persona.
5. Test & Iterate
Now you should have everything in place to give the first version of your voice app a go! Lastly, as is with most computational products, the final step closes the loop for continuous iteration and optimization. Your interaction most likely will not be perfect on the first try, and that’s ok.
Before you get to work building out the complex logic powering your app, it makes sense to prototype and test your concept. That way we don’t waste resources in case we went wrong somewhere. Depending on the tools you use you can quickly mock up a static demo with predefined input and output. Google’s API.AI offers testing tools right in its app builder GUI while tools like SaySpring let you test your prototype on an actual assistant device. Or you can just build a simple demo app without any logic behind it. With tools and demos given by Amazon, mocking up an Alexa Skill is really quick and easy.
Now go and test your prototype with potential users. As always, this is a quick and helpful way to identify design shortcomings or technical dept. Use what you learned by observing and analyzing the tests to improve your design and you should be well prepared to offer a solid voice interaction.
I’ll say it again just to make it oh so clear: testing, observation, and iteration are key to designing great voice interfaces. Keep that and the above process in mind and you should be well prepared to build your own app on top of an assistant platform in no time!
Originally published at https://chatbotslife.com on June 2, 2017.