The Art Of Conversation Is The Tipping Point

7 min readJun 7, 2016

Earlier this week Mary Meeker of KPCB presented and published a 213 slide dossier on Internet trends for 2016. A good chunk of this was devoted to the rise of ‘Voice as Computing Interface’, which I found particularly fascinating.

Mary’s presentation centred largely on voice, and voice recognition, suggesting that the inflexion point for broader adoption of the Voice Interface might be when speech recognition hits 99% accuracy from it’s current 95% rate, citing Andrew Ng Chief Scientist At Baidu:

“As Speech recognition accuracy goes from say 95% to 99%, all of us in the room will go from barely using it today to using it all the time. Most people underestimate the difference between 95% and 99% accuracy — 99% is a game changer…
No one wants to wait 10 seconds for a response. Accuracy, followed by latency, are the two key metrics for a production speech system…”

I whole-heartedly agree with all these points on accuracy and latency, but want to present an alternative as the game changer. Firstly, let me explain why I agree:

In 2010 when I began looking at Natural Language & Speech Recognition, there were very few ASRs (Automatic Speech Recognition) SDKs available. The two best choices were Nuance Dragon and Google Voice. Back then Nuance was far superior for accuracy, and Google Voice was lightning quick.

These ASRs evolved quite quickly through statistical learning and Google Voice accuracy had matched or beaten Nuance in the space of about 18 months. (I was still finding Nuance would prove inaccurate for some of my colleagues with an accent or those who over pronounced words, here’s a funny video illustrating the issue). Nevertheless, Nuance ASR is impressive.

Improving Accuracy

One of the improvements we saw over those early months was ASRs providing a number of possible ‘recognitions’ with a confidence score. This allows us to improve on the ASR output using NLP by parsing all returned recognitions with certain confidence score and correcting mistakes, before sending it forward for further Natural Language processing to perform an action. Also by doing a little bit of NL preprocessing using true language rules you can capture and correct other mistakes.

To illustrate, consider the words raining & reigning. If I ask Siri “Who is reigning in the UK at the moment?” this is interpreted literally as “Who is raining in the UK at the moment” and I’m presented with a list of search results about weather.

By capturing these subtle differences using a combination of analytics and true language rules this can be corrected to “who is reigning in the UK at the moment” and provided with a more useful set of search results about Queen Elizabeth II.

Improving Latency

As Andrew Ng points out there is plenty of latency here, particularly if your requests are crossing continents to foreign servers. A typical request and response via mobile can have six streams going back and forth from the phone to the internet:

Phone Sends Audio Stream to ASR Engine
ASR Engine Sends Back Text Recognition(s) To Phone
Phone Sends Text Recognitions to NLP Engine
NLP Engine Sends Back Response To The Phone
Phone Sends NLP Response To TTS Engine
Text-To-Speech Engine sends Audio to Phone.
Phone responds to User

Now there are ways to reduce this latency/lag by connecting ASR & TTS directly to the NLP engine, or by placing the NLP engine ‘on device’. I’ve experimented with both. The latter produces much cleaner architecture, and is far easier to implement and manage, providing great results.

I agree, the last 4% that Andrew Ng talks about will be the hardest piece to crack, but I don’t see Voice Recognition being the tipping point, here’s why:

Life-like Conversations Will Be The Tipping Point

On slide 116 Mary Meeker suggests “Voice = Should Be Most Efficient Form Of Computing Input”

Amazon Echo Is A Well Played Bet

If you watch Apple’s patent filings closely, you’ll have noticed a ‘Siri in box’ patent a few years ago, and the most obvious box for that was Apple TV. I’d always envisioned connected microphones embedded in all manner of home items from decorative ornaments to light switches.

Jeff Bezos’s bet on Echo/Alexa is near perfect — right product, right time. Controlling the gateway for the user is all important in order to provide the best possible user experience, and Echo is the perfect product for doing this in the home. It’s also a brilliant piece of kit!

These Virtual Assistants are well suited as platforms for third parties too. All the knowledge in them is a content play, and as such needs a diverse group of contributors for it to become both broad and deep, and thus useful. That’s far too big a task for one organisation, and would stifle innovation. The Amazon Echo playbook of a home device + SDK is probably the best I’ve seen so far for a VA.

Voice is really only one piece of the puzzle, and whilst the above slide alludes to it being required, Natural Language Processing is the key piece. Moreover ‘conversational Natural Language’ will become the tipping point.

For Alexa to become truly embedded in our lives three things need to happen:

1. Conversations Across Devices:

https://www.youtube.com/watch?v=GWpWnRDuUAw

Without heading into a features fist fight, generally all these VAs (Siri, Alexa, Google) are great at delivering results for simple instructions on any particular device (eg. search, directions, etc.), but I’ve yet to see anyone emulate what I’d always set out to do with Indigo — carry the conversation across devices. (more on that here)

This is important because we are so ‘on-the-go’ these days, and a mobile is not necessarily the ideal device for all occasions, the convenience of Alexa in the home has taught us this. When these conversations and actions can be carried across any device in the home, office, car or your hand, then they will become truly ubiquitous and a useful as part of daily life.

2. A supplementary interface for other input types:

Input via Text, Gesture & IoT Sensors Are Important

The always on voice features are great in the home setting, but there are times when a text based interface would be so much more useful (ie. when you don’t want to be overheard, or seen to be talking to a device). Before too long, gesture and other IoT sensor input will become possible, and thereby useful, in the same way we are using gps, gyroscopic, and light sensor inputs from mobile phones right now in our apps.

The best UI’s hide the complexity of an action and present the user with the easiest possible path to an action. For example Siri will often present the user with 2 tap button options before performing an action like deleting a reminder, as this is far easier for the user that going through the speech process again. Therefore other input types are needed.

This doesn’t mean Alexa is ultimately limited. With conversations carried across devices, Alexa could easily pair with other devices that have a screen, or other methods of input to contribute intelligence to the conversation or provide visual responses.

3. More Conversations, Less Instructions:

Here’s the crucial piece. Currently most VAs are managing instruction through a limited set of NLP functions like ‘slot filling’. So an instruction like “Alexa, Order A Pizza” leaves a number of slot variables open (what size pizza, what type of pizza, what time would you like it delivered). Now it’s easy enough to go back to the user with simple questions in order to fill these slots, or even infer the variables from other data (like previous orders). Slot filling in this way is not true disambiguation or discussion that can lead to much greater intelligence, completion of more complex tasks, and thereby becoming more useful.

To become truly more useful these VAs need to handle more of ‘the mechanics of conversation’ than they do now. That means understanding context rather than pair word matching through semantic taxonomy. It means handling topic switching whilst maintaining context, and it means having in-context memory (both in-session & multi-session).

Language itself has a defined set of rules, eg. grammar, yet Siri will interpret “Should I check the weather?” and “check the weather” in exactly the same way because it is merely slot filling and using simple word classification to construct an understanding.

Language is relatively finite too, few new words are added to the Oxford English Dictionary each year. From a dictionary you can start to build classification (like a thesaurus), rules for grammar, etc, etc. Once you have these building blocks of language in place these systems can become more sophisticated, and at this point we can build more AI & ML into the solutions to deliver greater automation and prediction.

When Siri, Alexa, and Google Home can work seamlessly across devices, and maintain contextual understanding across an entire conversation we’ll be cooking with gas!

….and that’s exciting.