Supporting Infrastructure

Published in

Social Robots

2 min readAug 23, 2016

Behind every Internet-connected voice interactive device is a huge amount of infrastructure that’s needed to support it. As users, we make a request to Siri or Google Now and get the correct response in hundreds of milliseconds.

This requires the coordinated efforts of hundreds of machines running cloud computing algorithms. For the first part, a speech recognition such as that used for Google Now might use your local language settings, your search history, your contacts, maybe even your email content, to optimized the language model specifically for you and then compare it against a larger population model. Or Voodoo. Then, the most likely response is streamed back and changed on the fly with a cool text visualization. Maybe 500 ms for a result?

For this to even start, the Android device and the Google server need to establish a secure tunnel to stream the audio. This too requires dedicated infrastructure to be on standby.

After the speech recognition, the natural language engine has to parse the text into intent. Again, more machines with pre-built classifier engines are called up (building a classifier for an NLU engine can also take a huge amount of memory). Once a classifier is built, the effort to search is less but it’s still not insignificant.

Once the server returns an intent for the request, it needs to fulfill that request. Is the request a local function (e.g. make a phone call)? Is it a query for information? Does it require the user to disambiguate?

Then, the query needs to be made, the results parsed/formatted for language (e.g. “The weather in Denver is 60 degrees and sunny” vs “location=Denver; temp=60; weather=sunny”), and then they need to be rendered to audio through speech synthesis.

Google is already working on hardware specifically designed for this purpose. New chips could reduce the infrastructure requirements for speech searches. Doubtlessly, there’s been a lot of optimization to these processes.

Amazon, Apple, and Nuance, among others, have developed infrastructure similar to the above. Even for the Ubi, where we could use API calls, we still needed to put together a significant amount of infrastructure to support these requests and their responses. It was not trivial.

Companies looking to go it alone or rebuild something similar need to consider the extraordinary efforts that were made over the past ten years to get voice interaction to this level. In fact, voice technology is one of those very clear demonstrations of Moore’s law and exponential growth — where technology is used as a tool to create new tools to improve performance.

Next time you make a query into any of these voice engines pause to reflect on the immense amount of computing power you’re about to call up. (But don’t pause too long — it’ll be detected as end of speech)

Supporting Infrastructure

Written by Leor Grebler