Voice vs. Visual Interface

Anurag Gaggar
A little more conversation
3 min readApr 28, 2017

Smartphones have had a profound impact on our lives (chances are that you’d be reading this on one right now). However, they are now widely being regarded as having reached the top of their S curve and changes in smartphones in the last couple of years have been more incremental than disruptive. AI, Machine Learning, Virtual and Mixed Reality are the most popular buzzwords in a discussion on the next big thing. In one such conversation at the lunch table today, we were discussing about voice interfaces and how they are getting better and popular of late. I have had the chance to be using the Amazon Echo for a few months now and thought of penning down my experience with voice interfaces. Where do they score over the visual interface (say the smartphone touch screen) and where do they lose out?

Image Source: Amazon.com

Voice recognition has made tremendous improvements and this technology has really become usable of late. Improvement in accuracy of recognition makes exponential difference in the experience (90% vs 95% accuracy makes a huge difference). In my experience with the Echo with an Indian English accent, the accuracy has not been at a level to offer a frustration free experience all the time. It gets a full command right about 7 in 10 times, while the accuracy of recognizing individual words may be closer to 90% (Google works better for me). However, this is one area that is bound to keep getting better.

Voice interface is better than a visual interface for doing repetitive tasks. Tasks that don’t require back and forth, where you know what you intend to do and don’t need to put a lot of thought to it. For instance: setting a timer in the kitchen, ordering a pack of blades for your razor, ringing your phone when you can’t find it, etc. You can do these tasks in a hands free manner, while doing something else in parallel.

However, visual interfaces score over voice in many ways:

  1. Voice commands are not ideal in a public setting. They work well in the context of your house or car, but not in a meeting room or office.
  2. Your eyes seem to have a higher priority thread to your brain than your ears. Think why people close their eyes when they meditate, sing or listen to music. Don’t know the innards of the nervous system as such, but I guess it is much harder for the brain to ignore what the eyes are telling vs. what the ears are.
  3. Visual information persists and can be consumed by you at varying speed without having to convey a command to anyone externally. A voice narration goes on linearly and if you miss a portion of it, you’d need to do something to pass this feedback. Even though the interfaces will evolve, controlling your own reading will always be faster and easier than controlling an external device emitting sound.
  4. A picture can say a thousand words. Similarly, much more information can be communicated in one screen view of a smartphone screen than what a voice interface can communicate in a few seconds. Hence, doing back and forth on a voice interface is tedious.

Devices like Amazon Echo and Google Home are becoming mainstream. Google, Siri, Alexa could be one of the most frequently spoken words soon. However, the visual interface ain’t going away anywhere. It will be interesting to see how these two formats merge and how MR scales up. Keep your eyes and ears open!

--

--