Designing for the Voice User Interface
Apple just released HomePod at WWDC 2017, Google has Home and is now becoming an AI company at Google I/O 2017, Amazon’s Alexa has been speaking through Echo for a while, and AI is taking the world via voice based interaction. Although voice based interaction has already been around for quite some time, from voice command in cars, Siri on iPhone, to voice input method in smartphones, it only has become enticing (more humane) with the advancement of AI technology. And it will continue to be the most prominent interface through which interaction with AI will be carried out. (Yeah! sooner or later I can have my version of Samantha.)
But why speech?
- speech is the fundamental form of communication, even kids who cannot read or write can communicate via speech. Although there are more expressive ways like writing and reading, you can find people greet, exchange ideas, debate, and persuade using speech in all cultures.
- speech is the most natural form of communication, interacting with Voice User Interface (VUI) feels like talking to a real human instead of a machine. Asking Echo to turn the HUE LED light into party mode would cost less cognitive effort than trying to figure out what to do on a graphical user interface.
- hands free for users to utilize their fragmented time efficiently by enabling users to multitask at the same time.
A Very Brief History
If we look back into ancient history, audio / voice interface has emerged back in the 1950s. And has gone through the following stages
- stage 0, voice input, basic voice recognition so as to convert speech input to text. (Bell Laboratories designed in 1952 the “Audrey” system, which recognized digits spoken by a single voice.)
- stage 1, voice command, simple audio based command
- stage 2, ability to understand and react in natural language
- stage 3, context awareness of the user and the device itself
- stage 4, personality, look at the customer reviews at Echo, users are taking Alexa as a human instead of machine.
- stage 5, emotion (and ultimately control human?) still remember Samantha from the movie “Her”?
Designing for the voice interface is totally different from designing a graphical one. There is no need to push pixels. And the skill sets, tools, guidelines that are familiar to a graphical UX/UI designer are no longer relevant.
If I were to design a voice user interface, the first challenge that could pop up in my mind is low affordance.
“…the term affordance refers to the perceived and actual properties of the thing, primarily those fundamental properties that determine just how the thing could possibly be used. Affordances provide strong clues to the operations of things. Plates are for pushing. Knobs are for turning. Slots are for inserting things into. Balls are for throwing or bouncing. When affordances are taken advantage of, the user knows what to do just by looking: no picture, label, or instruction needed.” — Don Norman, 1988
Looking at a graphical user interface provides users with rich visual clues (sometimes even overwhelming clues) of what they can do. But when presented with a voice interface, usually there is very limited amount of clue. How to provide just about enough affordance without disturbing the user by adding long descriptive narratives?
Lack of Context
Speech by its nature is volatile. Graphical interfaces responds only to user’s input and won’t go away unless user dismisses it, but voice interface vaporize with the elapse of time. The amount of cognitive payload for each round of interaction using voice is limited. How we could leave breadcrumbs for users to find out what context they are in? And how much breadcrumbs is necessary?
Full of Context, hidden between the lines
The natural language itself is difficult for machines to understand. One of the major reasons behind this is there is paramount of context information hidden between the lines.
Humane Touch, and how skeuomorphic should it be?
How to make it feel less like a machine and even personalize the device and add a humane side to it? How much human tune should be added? How far should we go down the skeuomorphic lane? And what type of personality should we present the product with?
I don’t have the answers to these questions yet. I guess the most efficient way to find out is setting out to build a voice user interface of my own to try things out. I looked around my apartment and found I have 2 wifi routers, 1 air purifier, 1 vacuum cleaning bot, 1 wireless stereo, 1 wifi camera from XiaoMi. And they are all connected to a centralized app on my phone. You can even define user scenarios via the app, like a IFTTT service. But, they don’t have product like Echo to control these connected appliance. I’ll try to hack things a bit to build a demo around this and post some of my learnings when I’m done.
BTW, I’m pretty sure MI is building one and it won’t be long to see a MI voice home controller. MI is investing heavily in AI, and an active contributor to the research and open source community of AI and deep learning. This could benefit them in creating their version of Alexa. On the other side, the array microphone technology is mature and familiar to them. The XiaoMi notebook shipped with a digital array microphone. The one question is what will be the form factor of this product? Will it be a standalone speaker similar to Echo or will it be incorporated into MI TV?