Alexa Voice Service and the Arrival of Voice

Leor Grebler
Social Robots
Published in
4 min readJun 14, 2016

Today, Apple opened up Siri to third party developers, following Google and Amazon, who’ve both been at it for well over a year.

We’re finally about to see the coming of age of voice as a primary interface with the Internet. It’s been several hard years and billions of dollars spent, but there are key indicators of this impending arrival, one being the adoption of Alexa Voice Service.

Alexa Voice Service was opened to developers in June 2015.

A quick primer: Alexa Voice Service (AVS) is an API by Amazon that allows for hardware makers to build voice interaction into their products and have an interaction that’s the same as the Amazon Echo. It was launched nearly a year ago and in the coming months, we’ll see the first hardware products hit the market with it integrated — products like Triby (a fridge-mounted speaker) or the Pebble Core (now on Kickstarter).

While various speech APIs have been around for over a decade, what makes AVS different is that it combines together several elements of voice interaction into one API.

For voice interaction, the elements are typically:

  1. Wake up word detection
  2. Speech recognition (aka speech-to-text or automatic speech recognition)
  3. User intent identification (sometimes as basic rules and more often now as natural language understanding)
  4. Fulfillment (e.g. integrations with different Internet services or connected devices)
  5. Acknowledgement (sometimes as beeps, sometimes as speech synthesis or text-to-speech)

What’s different about AVS is that it does the last four of these five steps. This takes a lot of the hard work out of integrating voice into a product. Rather than needing to build all of these components, a hardware maker just needs to call one API.

This represents a huge savings when it comes to implementation of voice as a feature. If a speaker maker, for example, wants to add voice interaction to their existing Internet-enabled speaker line, they don’t need to build the integrations with speech recognition services and natural language understanding engines. They don’t need to build music integrations with Spotify or TuneIn or other music services. They also don’t need to implement text-to-speech services. These are all bundled together with AVS and the speaker maker can offer to their users the same number of features that an Amazon Echo user would be able to access.

However, there are still gaps to be filled in implementing AVS on a device.

First, the current API only supports push-to-talk functionality, meaning that the experience for the user would be similar to the Amazon Echo Tap — pushing a button on the device and then speaking a command. Inevitably, device makers will want to have handsfree interaction like the Echo, meaning that a local wake word will need to be implemented on the device.

The other challenge is in adding the far field capabilities similar to the original Echo in order to be able to speak to the device at a distance greater than arms’ length. This means implementing digital signal processing technology, typically with multiple microphones, that augments the signal of the speaker’s voice but keeps noise in the environment at the same or a reduced level. The Echo has a seven microphone array and multiple processors on board to handle this. For device makers, they need to find corollary technologies that won’t significantly increase the bill of materials cost.

Other considerations a hardware maker will have to take when using AVS are branding and data ownership. With AVS, you have Alexa — her knowledge, her voice, and her nuances. There’s no tailoring of the interaction. There’s also no control of the data that’s being sent to AVS or the replies that are coming back. This may impact how a device maker markets its Alexa-enable product.

The tradeoff is that Amazon has already primed the market with four million Echo users who are familiar with Alexa and who might be more willing to have another Alexa-enabled device in their home rather than a Google or Apple device.

It’s likely that other major players will aim to bundle their voice services with integrations. Currently, Google, Nuance, IBM, API.AI, Microsoft and Houndify only have discrete speech services. We’ll likely see a few of these companies come out with complete voice service APIs that will bundle all of the steps of voice interaction, such as a Cortana or Google Now API.

The next battle will be how to incentivize device makers to add one service over the other to their products.

--

--

Leor Grebler
Social Robots

Independent daily thoughts on all things future, voice technologies and AI. More at http://linkedin.com/in/grebler