The case for the voice interactions

4 min readFeb 11, 2018

I think one could be much more productive talking to a computer.

And this is not about conversational assistants like Siri, Alexa, Cortana or the one from Google. I think personification and attempt to imitate human conversation is a nice niche product. It is good to have it in your car, at your kitchen desk, and on the go. But the voice interactions are a much broader topic.

Touch is natural for humans and introduction of it made the interactions with computers simple, intuitive, and fast. Even with arguably bad UI design devices with the responsive touch-based interface are delightful.

My grandpa figured out how to use WhatsApp on a cheap Android tablet back in his village at the Far East of Russia, and now we have the miracle of being instantly connected. (I just shared photos of the Falcon Heavy launch with him!)

Similar to the touch, the voice is natural for humans. It is fast and expressive. Pronounced words being put in a context with a look, an emotion, and a gesture are the primary way of humans interacting with each other.

But typing is faster

Is it? Did you ever feel as though you’d better stop sending messages and call a person already?

The context is the key. The speech bandwidth is low, but we are using rich context to successfully package a lot of information in a short conversation. We are establishing a common language; sharing stories and references to introduce high-level concepts; exploiting emotional variety.

We value humans who have the intelligence to read a room and be appropriate in a cultural context. We appreciate others who can understand us in just one look, one gesture, or piece of a word.

We will cherish computers that can do the same.

Contextual understanding

The modern computer can see where are you looking (with FaceID, Windows Hello), feel a gentle or decisive touch (with 3D Touch). And of course, it knows what you are doing.

Knowing the context, it could make sense of what you are saying. Look at the sound icon on your laptop and say “mute,” no need to click. Tap on an empty place at the canvas and say “rect” to create a rectangle, add “fill it red” to make it red, keep your finger and say “duplicate” to get a copy.

Use soft, calm voice. It is a smart machine, and it is close to you, no need to shout at it. It knows you and reacts to your voice; it ignores background noises and even colleagues chattering nearby.

It is just not normal

Of course. Nothing new is normal while it is new. Some people still sincerely ridicule the look of the AirPods. But once you have a subway car with four people in a row wearing them…

People were appalled by wireless headsets and “s/he is talking to himself” look. Or selfies. Or bicycles. Or glasses. Or trains. We adapted to all of that and consider a part of the norm today.

I think we will get used to mumbling with our computers just the same.

It is noisy

Typical open space is already a noisy place because humans are talking time to time as well. So many prefer to sit with headphones on. (That also have microphones… I am just saying.)

Building software products for last decade, I should tell you many designers and engineers mumble to computers anyway. Today it’s mostly useless nagging and complaints, though.

(Users in my experience often say what they think to computers too, and it will be only fair to deliver these remarks to software authors. Finally, the WTF meter* will be implemented properly to close the feedback loop.)

The noise isolation is a big problem of open space office layouts and should be addressed with the proper materials, enough space, and well-designed furniture. Then both a casual conversation between colleagues and constant chat with computers will be more bearable.

Who could deliver this?

I think the obvious two leaders are Apple and Microsoft. Both have control over the full stack and can deliver this next generation of human-computer interactions.

Recent releases from Apple include FaceID and MLKit allowing attention tracking, as well as on-the-device real-time voice recognition. These seem to me as the building blocks of the exact future I am talking about.

The emergence of Augmented and Mix-reality applications will demand the voice input. The same way as touch did not kill the pointer + keyboard desktop immediately, the augmented reality is not going to kill the screens too fast. Instead, these will all co-exist and influence each other profoundly.

The way we interact with computers will evolve and become even more personal and even more human.

Check out The case against voice interactions, a short follow up where I am talking a bit more about Personal Assistants and what role they are playing in emergence of Natural Interfaces.

References

Put That There by Chris Schmandt (1979)
Open offices are overrated by Vox
*WTF/m by Thom Holwerda
The Mother of All Demos by Douglas Engelbart (1968) — coherent experience of human-computer interactions.
Dynamic land — the modern days similar project by Bret Victor and Alan Key.
How to Invent the Future I — CS183F by Alan Kay.

The last three reference just a root-level for any HCI conversation, so I decided to add them here despite no direct mentions.