Chatty devices, by Sandy van Helden

How voice assistants seemingly came from nowhere

If you weren’t paying attention, you’re probably wondering why, all of a sudden, we’re conversing with devices like we can with humans. Instead of clicking on buttons on your screen, in 2016 you’re able to simply talk out loud in a room, and have a speaker with a voice assistant inside understand exactly your intent and what you said.

This article is a part of the “Do you speak human?” lab — enabled by SPACE10 to explore conversational interfaces and AI. Make sure you dive into the entire publication.

Over the last few years, inventions such as Apple’s Siri, Google Assistant and Microsoft Cortana have flipped the way we think of interacting using voice with our devices on its head.

Siri, which launched in 2011 along with the iPhone 4S, was incredible at the time: an assistant that could understand you, and even crack jokes — how could a computer be so charming?

“Siri understands context allowing you to speak naturally when you ask it questions, for example, if you ask “Will I need an umbrella this weekend?” it understands you are looking for a weather forecast.”
 — iPhone 4S launch, 2011

Fast-forward to now and devices like Amazon’s Echo and Google Home, which are small speakers that sit on your countertop with powerful microphones, are able to hear you from anywhere in the room and understand what you want, then execute that task in a matter of milliseconds.

Saying “OK Google, dim the lights” and it actually happening was something of a pipe dream just a few years ago — but now it’s technology all of us can own for just over a hundred dollars.

Your new best friend.

Amazon Echo and Google Home are the personification of voice assistants, offering them a spot in the home where you can interact with them on an ongoing basis throughout your day, rather than just as a tool on your pocket computer and that’s the key to starting a long journey toward our dystopian Her-like future.

Slow, but steady progress

How did we get here, exactly? Well, it’s been a long road. For years, it’s been slow progress. As far back as 1970 IBM already had a computer that could take a simple sentence in and understand at least the words themselves, with one catch: it took over an hour to crunch the data.

The biggest problem, other than sound quality, is that a sentence could quite literally start with any word, and a five-word sentence made up of vocabulary of 20,000 words could have 3.2 x 10²¹ possibilities. In other words, it’s a huge task to actually figure out what you actually said before even tackling what you mean.

Nuance is one of the biggest speech recognition and text-to-speech companies in the world, and just happens to be the company that provided the technology for Siri. The voice assistant actually started out as a military-funded project, but eventually spun out into its own company.

But what’s interesting about Nuance isn’t that it’s a speech recognition company, but it’s also heavily invested in artificial intelligence, which is required to interpret the words you’re saying.

Neural networks set out to replicate the brain’s learning system, by Sandy van Helden

Google, Amazon and Microsoft are also making enormous investments in artificial intelligence — or neural networks to be precise — to help understand what you really mean from the millions of possibilities in every sentence.

The problem with computers is they’re not very good at understanding reason or context like humans are. The order of words in a sentence can drastically alter its meaning, but a computer can’t intuitively know why that’s the case. Neural networks address that by allowing computers to make sense of the world by training themselves in what they see.

I didn’t teach Google what a beach is, but it knows somehow

A good example of this in action is Google Photos: it uses a neural network to learn the contents of your photos, then lets you search based on the things shown in them without you ever typing a word.

If you search “beach” you’ll get every photo of a beach in your collection — that’s because Photos has seen so many photos of a beach that it’s learnt what it is. Creepy? Maybe. Useful? Absolutely.

Neural networks learn from data coming in and use algorithms to train themselves at better understanding the world. So the more you use your voice assistant, the smarter it gets. Neural networks, however, take immense resources to crunch data — far more than is available in your phone.

Where’s the real breakthrough, then? It lies in two places that converged, seemingly all at once: bandwidth and cloud hosting.

Voice processing was always difficult in the past because your machine’s hardware alone wasn’t fast enough to actually crunch the data needed, let alone learn from its experiences — and until recently, your connection was probably too slow to send that somewhere else to be processed.

In 2016, with high-speed LTE and domestic internet connections the norm, it’s flipped the equation on its head. Now you’ve got a big enough, always on internet-pipe that’s able to send your voice data in the blink of an eye. The bandwidth problem has, for the most part, been solved.

In the last decade there’s been a fundamental shift in how computing is done in business. Before, if you wanted to build an online service you’d need to buy server hardware, a physical space, buy access to an internet provider and a whole lot more.

Now, thanks to Amazon Web Services you can have access to the most powerful computer hardware in the world without moving slightly on your couch — for a few cents an hour. That happens to be perfect for running a voice assistant, even on a massive scale.

Siri, if you strip away the fancy interface, is actually a product of Moore’s Law — the rule in computing that the number of transistors in a chip double about every two years. The only reason it was released when it was, and not sooner, is because previously the computing power simply wasn’t available to deliver a meaningful, conversational experience.

In other words, computers needed to get powerful enough to both understand you and synthesise a sentence in just a few seconds. Comparing just how much processing power it takes to crunch a Siri search up against a normal web search puts that in perspective:

“The computational resources required for a single [Siri] query is in excess of 100 times more than that of traditional web search.”

Data, data, data

One other problem stood in the way of voice assistants getting off the ground: they didn’t actually know anything at all. Each assistant starts out at ground zero, and needs to be trained.

Data processing is done in the cloud, by Sandy van Helden

If you used early voice recognition software like Dragon Dictate, which was released in 1990, you probably remember that to actually have it understand you, it was necessary to read phrases to the computer for hours on end. For modern assistants that’s actually still somewhat true, but how it’s done might surprise you.

As mentioned earlier, the more voice data a company has to crunch, the better the assistant. Google used an interesting method to grow its own database of voice samples — likely without you even knowing it:

Marissa Mayer, then a Google VP, explained at the time, “The speech recognition experts that we have say, ‘If you want us to build a really robust speech model, we need a lot of phonemes, which is a syllable as spoken by a particular voice with a particular intonation…1–800-GOOG-411 is about that: Getting a bunch of different speech samples so that when…we’re trying to get the voice out of video [or other tasks requiring voice recognition], we can do it with high accuracy.”

Another problem was the microphones themselves: it’s pretty hard for a computer to understand what you’re saying with all that background noise.

Far-field microphones, used in both Google Home and Amazon Echo, are special, powerful arrays of microphones that let devices zero in on your voice regardless of where you are in a room, or what background noise there might be.

The concept of microphone arrays and far-field technology isn’t new, but the algorithms used to detect, as well as follow, your voice in a room are new. High-quality audio is fundamental to getting the computer to understand the query in the first place, as well as reducing confusion or incorrect queries.

Amazon was the first company to use one of these powerful microphones to solve the voice quality part of the equation, and everyone else is rapidly following suit: you can already find it in other products like Google Home and Sense.

Not quite there yet

Now you know that Siri and Google Assistant are akin to black magic, it’s time to pull back the curtain: they’re still not very good.

It’s easy to flummox Siri by asking it a question, like “What’s the best Thai food around here” then following up by saying “how far away is that?” Siri still has no idea what “that” relates to, because it’s already forgotten your previous search.

Google Assistant, however, is one step ahead of Siri in this regard. You could ask it “What’s that movie with Jennifer Lawrence?” Then, when it answers, you could immediately follow up and ask “What year was that released?” You’ll get the correct answer.

Assistant’s short term memory is better than Siri’s, but both tools still don’t know enough cues about you to be really useful. Google’s own Assistant team gave a great example of how easy it is to take the magic away:

Right now, the Google Assistant will perform to expectations if you ask it to book a table at a Mexican restaurant near you. But if you ask it for a table at “one of my usual places,” you’re taking a Thelma and Louise drive into the Flummoxed Valley.
“Sorry,” it will say, “I can’t help with that.”

What does usual mean? Assistant knows where you’ve been lately, and how far you might be willing to drive, but there are so many permutations of what “usual” could imply based on time of day or location that it’s a confusing query.

The key is understanding the true meaning of places, things and context. Google knows where you are, where you’ve been, how far you generally travel, how long you spend at home, when you last went overseas and a whole lot more — but it doesn’t really know what any of those signals mean.

What’s the biggest challenge to getting there, then? According to Scott Huffman, VP of Cloud Developer Experience at Google that’s making it useful enough to keep using it:

“Honestly, the challenge for us is going to be to have enough of the conversational capability — which we think we do — to convince people to keep doing it.”

The holy grail of voice assistants is exactly this: understanding the true meaning of disparate data. We’ve gotten to the point where they can figure out what we’re saying, and talk back to us, but they still don’t know how to connect all of the dots together.

If you look at how far we’ve come since Siri first debuted in 2011 it’s actually pretty incredible: when Siri launched, you’d say your query, wait for a while, then get an answer.

Now you’re now able to talk to a voice assistant in real time, then get a response within a second or two: it only took five years to get to this point, and it’s already normal.

Advancements in processing power, bandwidth and neural networks themselves have facilitated this transition — and another five years from now we’ll find it hard to understand how we ever lived without voice assistants.