Designing with Voice AI
The use of voice for communication has been with us for about 50,000 years. But for most of the history of computing, voice as an interface has been out of reach. We’ve settled on less error-prone modes of communication with machines. We’ve seen this change slowly since the 1950s, when the first speech recognition systems were developed to recognize numerical digits. In a classic story on pretotyping, IBM tested a Wizard-of-Ozzed setup for speech transcription and found that most users grew tired of using their voice for longer periods of dictation.
It’s only recently that we’ve reached the point where voice interfaces can be reliably used for giving simple commands and queries. This is an exciting step because it means that more people are using voice interfaces. In fact, according to ComScore, 50% of all searches will be voice-based by 2020. This usage will inevitably result in more feedback which will influence future designs.
But we can’t plateau here, stuck in the humorous and infuriating world of #SiriFail. When creating the next iterations of voice systems, designers should keep in mind the nature of voice as a mode of communication.
The Nature of Voice
Now that voice interfaces are becoming more practical, there’s a temptation for designers to add them to every conceivable product. However, it’s prudent to keep in mind the characteristics of voice.
Voice interaction is a natural way for humans to respond to rapidly changing circumstances. In team-based sports or workplace collaboration, it’s the easiest way of reacting to change. As coach von Moltke the Elder said, “No play survives contact with the opposing team.”
Voice interfaces have an effectively infinite surface area. Similar to a terminal app (or even a chat/messaging app), new functionality can be added without cluttering the interface. Contrast this with the meme of the overcrowded GUI with buttons and other widgets jammed into every available space. Of course this strength comes with a downside — more on that in a moment.
Voice is one-dimensional, time-based data. It doesn’t have a spatial component. But we deal with a lot of multi-dimensional, nonlinear information. Have you ever booked a trip or listened to movie schedules via an interactive voice response system? If so, you know the pain of using a voice interface to navigate tabular information. The affordance of scannability is noticeably missing.
Command-based voice interfaces can easily interrupt the flow of human activity and conversation. Some controls are much better left to non-verbal interfaces. This is well illustrated by those meetings in which someone is presenting while another person is controlling the slides. The natural flow of the content is interrupted constantly by the speaker requesting, “Next slide, please.” Imagine a voice-based system to control a presentation in this way.
The flip side of having an infinite surface area is the problem of discoverability. Voice systems suffer from this problem. It’s part of the reason why Amazon regularly sends emails to Echo customers to remind them of capabilities and voice commands to try. Of course, humans have a discoverability problem as well. To get around this we sometimes wear badges and name tags with slogans like, “Ask me about the daily discount!”
The most important design consideration when working with voice is to prefer multimodal systems. Target “voice plus x”, where x is another mode of communication that accounts for the weak points of voice. Just like face-to-face human voice conversations, designs that stack modes will be more resilient and less annoying for users.
The first and most obvious complement to voice is a visual interface. This has been covered elsewhere, but is worth reinforcing. Screens are getting larger, smarter, and more numerous. They represent the easiest way to display confirmations of voice interactions, and to provide access to multi-dimensional data for scannable navigation. We’ve already started to see this in the consumer world with Siri and Alexa available on various screens and screen-connected consumer devices.
Another complement to voice is gesture. We use gestures all the time in human spoken communication, and they provide a second channel of information to modify the content of our speech. Gestures are a useful input mechanism when combined with voice. In Cisco Emerge, we’re using gestures combined with voice in TeamTV, our always-on TV channel for distributed teams. The combination is a useful way to provide a control interface to a video system without requiring a keyboard or a control panel. A wake wave activates the voice control system. This is a good way to avoid having to constantly listen to wake words, which break up the flow of human interactions.
With the improvement of technologies like computer vision, it is becoming practical to combine voice interfaces with context from an interaction. An example of this is using facial recognition to determine participants present in a meeting. So with multi-user interactions, the voice system will now recognize which person is issuing a given command. And if a person’s name is used as part of an utterance, then the system will know which person is being referred to. This enables richer interactions and brings AI capabilities closer to the content.
As speech recognition and NLP/NLU improve, the content of conversations will become available to voice-based systems. This will enable a slew of new improvements, where needs can be met in more responsive and anticipatory ways. Instead of purely issuing commands to voice systems, we’ll be able to conduct conversations with each other while voice systems extract actionable content from these conversations to act upon. And yes, it is almost impossible to speculate about the future these days without sounding like a Black Mirror episode.
Aside: Virtual Reality Interfaces
When designing virtual reality apps that involve creating content, a common weakness is the lack of a convenient input method for text. Using a keyboard while in VR is problematic. Voice interfaces provide a partial solution to this limitation. In Spark VR, we’ve created an experimental build with an integrated AI system that responds to voice input. So far it’s a promising way to interact with the system.
The improvements made in voice interfaces in the last decade are a cause for celebration. Finally it’s possible to leverage a natural human communication mode for our interactions with machines. We’re clearly just scratching the surface of what is possible, and it’s my hope that the design of future systems will leverage voice in a more natural and effective way.
About Cisco Emerge
At Cisco Emerge, we are using the latest machine learning technologies to advance the future of work.
Find out more on our website.
Note: This article is adapted from a talk I gave at Leaders Meet: Innovation in London in January.