The Paradox of Voice User Interfaces

I’ve recently been thinking about how more and more consumer products have incorporated Voice User Interfaces (VUIs). It makes sense; using our voice to communicate feels natural. Most people are used to communicating with each other using their voice. So why not communicate with computers this way?

Despite the recent trend of VUIs, the desire to talk to computers is not so new. Their development actually began almost 40 years ago — far longer than the internet as we know it has been around.

I came across a short video that demonstrates a voice and gesture based interaction from the MIT Media Lab in 1979. The video shows a man sitting in a chair, pointing a remote at a large projected screen. He uses the remote to move a cursor on the screen, then uses his voice to tell the computer to draw a shape where the cursor is. He draws a few shapes of different colors and then gives the computer directions to move some of the shapes to different places on the screen, relative to each other.

“Put That There” https://youtu.be/RyBEUyEtxQo

At the end of the video, the demonstrator tells the computer to “put a large green circle there.” He points the remote towards the left side of the screen, and a large red circle appears. Oops.

What’s striking to me about this demonstration is how similar it feels to using one of today’s voice-controlled products. The frustration the demonstrator experiences at the end of the video when the computer misunderstands him is all too familiar.

Not only is it frustrating when a computer misunderstands you, but it’s frustrating when a human misunderstands you. We already experience this frustration in our day to day life, and here we are building into our products. Talk about the paradox of making technology more human.

There’s more hiding in this paradox. Misunderstandings often come from not being specific enough with our words. Yet the real power of voice user interfaces comes from our ability to use specific language with them. Instead of clicking through multiple menus or layers of a UI to select an option, we’re able to vocalize a specific command. We have direct access to every action through our voice. Yet, it’s often difficult to be entirely specific with our words.

For example, think about the interface for a color picker. On a graphic user interface, it doesn’t take much to select a color. But without any sort of visual interface, how would you describe that color with your voice? There are over 16 million colors in the RBG spectrum yet we have only 50, maybe 100 words to describe colors.

In the video, the demonstrator names a few basic colors. He’s probably limited to a select list. But if he weren’t, it would be pretty difficult to describe an exact color on the RBG spectrum. And whose fault is that? The human, for having a limited set of words for which to describe one of 16 million colors? Or is it the computer’s fault, for not having the capability to infer what the human wants based on the context of the vocal command?

I suppose it may be neither entity’s fault. Picking colors is certainly a calculated example where VUIs obviously fall short, since colors are visual, not aural. But it does show that the limits of our interfaces are not always defined simply by what we can and cannot program them to do. Sometimes it’s our own limits as humans, like the inability to describe 16 million colors, that can keep us from communicating effectively.

Almost 40 years later, we’re still searching for the best ways to incorporate VUIs into our products. I don’t believe we’ve gotten it right just yet, but if they understand the difference between green and red, well… things can’t be too tough from here.