The Now and Future of Speech Interfaces

And exploration via prototypes and thought

We have been dreaming about speech as an interface for a long time. It’s not an easy problem, but we are finally close to making it a reality.

Doesn’t he look so happy talking to a computer? Joaquin Phoenix from the movie Her (2013)

To build a product with a kickass speech interface, you need two things:

  1. Audio Hardware: A good microphone and speaker. This could be a device for the home, car or office. It could also be a smart headset or earplugs to be used while commuting, working out or walking the dog.
  2. Speech Recognition Software: A speech interface needs to understand what users are talking about. The first step is having a high quality speech to text engine. The second step is understanding intent from the resulting text. The third step involves enabling developers to build upon this platform a new world where we talk to objects. And, in a few years, cannot imagine living life without talking to objects.

Over the last couple weeks, I have explored how far along we are in this journey and what possibilities exist to make a speech interface today. I built an iPhone app using Google’s recently released Speech Recognition API, that lets users issue voice commands to remember their thoughts.

As a shortcut to intent recognition, I expect users to want to remember a book, movie/tv show or a place when they issue commands starting with read, watch or visit respectively. Given an intent, I use the respective search API from either Goodreads, Open Movie Database (OMDB), or Yelp to fetch content. For example — “watch The Big Short” will fetch the movie “The Big Short” using the OMDB API and save it locally.

Speech to Text is pretty much solved. Text to Intent is still a hard problem.

Speech to Text

Due to massive investments by Google, Amazon, Microsoft and Apple, I think this first step of having a high quality speech to text engine is pretty much solved. Alexa is able to transcribe my speech to text very well.

Google’s Speech API (while still in beta!) is VERY accurate and trained to transcribe text even in noisy environments. It expects developers to provide un-preprocessed raw audio data and gives them a lot of flexibility in how they want to use the API. For example, you can ask it to detect single or multiple statements, provide streaming audio input or upload individual files and ask it to return the end of single utterances, end of speech or the end of audio. The only time this API didn’t do well was when I was speaking with a TV show playing in the background.

Text to Intent

Having achieved high speech to text accuracy, I expect a lot of investment to happen in improving intent detection from text in the near future. You can get quite far with a simple rule-based intent detection system. From the pattern of commands my Amazon Echo expects, I’m guessing Alexa uses many rules in its intent detector. Building a highly versatile machine learning solution for intent detection is going to be difficult, simply because in many cases, the same text can have very different ‘correct’ meanings.

One of the main problems that makes <text> parsing so challenging is that human languages show remarkable levels of ambiguity. It is not uncommon for moderate length sentences — say 20 or 30 words in length — to have hundreds, thousands, or even tens of thousands of possible syntactic structures. A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context … Humans do a remarkable job of dealing with ambiguity, almost to the point where the problem is unnoticeable; the challenge is for computers to do the same.
 — via Google Research Blog

There are several ongoing efforts in this area — and let you train Natural Language Processing (NLP) models based on your rules and training data. Ex-Siri founders are working on a new company called Viv that looks like a much more context-aware version of Siri. Google has been launching new NLP APIs and Open Source initiatives over the past few months.

The Future Of Speech Interfaces

A new interface modality is a massive driver in creating new experiences or significantly improving upon existing clunky-yet-functional experiences. This enables new usecases to appear that otherwise would have been too hard. Example — direct manipulation afforded by touch screens in smartphones made several existing experiences like Communication, Maps, Self Expression much easier and more useful. On the other hand, it led to several new experiences like the musical instrument app Ocarina by Smule.

Similarly, adoption of speech interfaces could lead to improved experiences of two kinds:

  • A speech interface for existing apps: All existing apps could add a speech interface that accepts quick commands. A music app could play a song, a shopping app could buy things that come to mind or a communication app could send a friend a message on your behalf while you are cooking dinner. In fact, many quick tasks that I need to take my phone out of my pocket, fiddle with a passcode screen and eventually use the app can be easily achieved by a speech interface for that app. This would help people be more present and less distracted.
  • Unlock new experiences: Speech is a great interface to issue commands to a computer while doing other things. Speech could be a fantastic interface in an immersive virtual reality environment. Speech interfaces could also help us cross language barriers in work and during travel. Speech could also help us formulate complex search queries much better than typing, hence make it easier to retrieve information.

A speech interface needs to be quick, non-interruptive and accurate. A fire-and-forget experience where it lets users issue complex commands, guesses the right intent and executes the command all within milliseconds.

It is an exciting future and I can’t wait to live in it.

Thanks to Gaurav Dosi, Chinmay Jain and @prup for reading drafts and providing feedback.