Why doesn’t Alexa understand me?

A comparison of the leading voice platforms.

In this article, we compare the leading voice platforms and measure their ability to provide raw user input text in an open domain context. We also measure their abilities to provide raw transcription, outside of their voice platform frameworks.


Introduction

Cuida Health is a leader in adapting voice first technology for seniors. There are tens of thousands of voice user interface (VUI) applications currently available. However, the vast majority of these are simple call & response bots (e.g. order a pizza or buy some flowers), restricted and scoped inside a very narrow domain. The user journey is linear, the number of contexts is small, and the intents are distinctive and well-defined. What would it take to go beyond these basics?


Motivation

At Cuida Health, we believe that VUIs will quickly evolve beyond simplistic functional tools, just as mobile apps quickly moved beyond drag-and-drop designs into richer engagements and interactive experiences. The annual Alexa Prize (1) to advance conversational artificial intelligence showcases the future of moving beyond transactional and into conversational VUIs.

Current tools from Amazon and Google for building apps for their smart speakers are geared towards transactional apps that can be easily encapsulated with a state machine, having a small number of discrete steps and defined transitions. However, natural human conversations flow more easily without such restrictions.

Consider the prototypical example for voice user interfaces, ordering a pizza. Can we define all the necessary intents and utterances? Even if we could, is this an ideal voice experience? Or is the user jumping through artificial hoops that are artifacts of the current toolkits?

<How many pizzas would you like to order?>
<What size pizza would you like?>
<What kind of sauce would you like?>
<What kind of crust would you like?>
<How much cheese would you like?>
<What crust flavor would you like?>
<What toppings would you like?>
<Is this for pickup or delivery?>
<What is your address?>

We believe that for many users, a more natural conversation will lead to a better experience. There is large user experience gap between ordering by interrogation and ordering at a counter with a person. And we think a more conversational dialogue will help to bridge the differences and encourage richer experiences on these new voice platforms.

<How can I help you?>
Do you have any specials
<We currently have a $9.99 special for one medium single topping pizza. Are you interested?> 
What would you recommend for two people
<A dinner box for $19.99 is available. Would you like to order this combo?>
Nah. I’ll take two large pepperoni pan pizzas and a bottle of Pepsi
<Got it. Two large pepperoni pizzas with pan crust, regular cheese. And a two liter of Pepsi. Is that correct?>
Yep, and add sausages on one and mushrooms on the other

Supporting more natural conversation means moving beyond specific prompts with narrow domain responses towards open-ended questions with less constrained responses. In some cases, it is not ideal to construct every step of dialog inside a model of strictly pre-defined intents. In conversational settings, an open-ended response is sometimes more natural, where the raw user input can be post-processed and refined by more customized and context-aware logic.


Methodology

[Figure 1 — Common NLP Components]

Alexa Skill Kit (ASK) Model

For capturing raw user input on Amazon’s ASK, we investigated both the custom slot option and the newer Amazon.SearchQuery option for unpredictable input (2,3). Amazon.SearchQuery requires a carrier phrase, which is generally unavailable in the open-ended scenarios. Therefore we focused on custom slots. After defining these models, which include a single “CatchAll” intent with a single “FreeText” slot, we measured the performance of different combinations of samples and utterances.

Our fulfillment endpoint for this skill was hosted on AWS Lambda. The best approximation of what the user said is provided to the endpoint as the slot value for the intent.

[Figure 2 — ASK model]

Actions on Google Model

Compared with ASK, capturing raw user input on Google Actions is more accessible. We created a Dialogflow model with a Fallback Intent.

Our fulfillment endpoint for this action was hosted on Google Firebase. The best approximation of what the user said is provided to the endpoint as a property.

[Figure 3 — Dialogflow Model]

Datasets

To better approximate real-world conditions, we drew from two different datasets that contained audio recordings from a variety of real people in various audio settings (4).

The CV samples were generally noisier, and frequently included background sounds, people with different accents and genders, and a wide range of audio levels. The LS samples were generally cleaner, being comprised of audiobooks read by volunteers.

Both datasets provided reference transcripts of each recording. The CV sentences geared towards conversational and short utterances. The LS sentences were longer and more structured, reflective of the written nature of the underlying content.

We converted samples into the preferred audio input format for both platforms, 16-bit linear PCM@16kHz, 1 channel, little endian byte ordering (5,6).


Testing

For each environment to be tested, we sent the audio file to the respective service and recorded the text value handed to our endpoint. We compared this hypothesis from the model with the reference transcript, using word error rate (WER) for our metric (7). We calculate the accuracy across a range of WER threshold values to generate a performance curve, on which to compare different voice platforms and to evaluate the impact of various model settings (intents, utterances, etc.) within a platform.


Results

For skills trying to capture unrefined user input, Google seems to do a better job of providing access to the raw user input. Google Actions did not require us to define any language model beyond the fallback intent, which seemed to pass through the user input more cleanly.

Amazon ASK has a built-in fallback intent, Amazon.FallbackIntent (8). However, it does not expose the user input and only indicates an out of domain request occurred. Amazon.SearchQuery also was not suitable since it requires pre-defined carrier phrases that negate the goal of capturing open ended user input.

[Figure 4 — Common Voice dataset]
[Figure 5 — LibriSpeech dataset]

To try and improve the Alexa performance, we experimented with different variations of a CatchAll Intent, attempting to create a language model that was as broad as possible to pass through unrefined user input. We tried creating slots with varying numbers of random sentences: a small set of 22 examples, a medium set of 600 samples, and a large set of 10,000 samples. There did not appear to be a strong correlation between the size of the random set for the custom slot and the skills ability to pass through the user input via that slot. We also tried various other schemes, such as the repeating gibberish word (9), to capture unconstrained user input.

In the end, we were unable to devise any workarounds for Alexa ASK that would meaningfully increase the ability of a developer, writing a conversational skill, to access the user input in an unconstrained open domain context. An Intent with a small set of randomly generated sentences seemed comparable to the other more complicated variations that we tested.

[Figure 6 — CatchAll on Common Voice]

Given its behavior in open-ended conversational settings, we tested how Alexa ASK performed in a constrained setting, which is probably more aligned with the intended transactional use cases (e.g. ordering pizza). Starting with exact matches, we then degraded training utterances to measure how they would impact performance, as the training set shifted away from matching the audio exactly, going from a purely in-vocabulary context towards an out-of-vocabulary context.

When the user input exactly matches utterances the model was trained on, we can see almost near perfect results. However, when the model is provided utterances that are randomly shuffled variations or in-sequence variations with dropped words, we can see that the skill’s ability to accurately pass through the input is significantly impacted.

For example, if the skill was trained on a model containing the utterance “are you gonna throw a rock”, it would pass through the input correctly when it heard the audio file for “are you gonna throw a rock”. However, when trained on a model containing the transcripts in randomly shuffled word order, like “throw rock gonna you a are,” ASK would take the same audio file input and pass through “I’m gonna throw rock.” Similarly, when trained on a model containing the transcripts with 25% of the words dropped, like “are gonna a rock”, ASK passes through “are gonna a rock,” given the same audio file input.

[Figure 7 — OOV on Common Voice]

So does Google simply have better underlying ASR technology? We don’t think so. In a comparison of the transcription services provided by Google and Amazon over the same dataset, it appears that Amazon has a quality standalone transcription service.

[Figure 8 — Transcription Services]

When we compare the performance of transcription services against voice platforms, it appears that Google Speech and Actions are fairly comparable, while Amazon Transcribe outperforms Alexa. It remains an open question if there are alternatives to retrieving the raw user input on Alexa that would perform comparably with their Transcribe service. The Amazon.FallbackIntent would seem to be a possible place to expose the raw user input for advanced applications, in a similar vein to the Fallback intent on Google Actions. It seems likely that Amazon’s Alexa Prize contestants have access to some form of raw dialogue for their conversational agents, so we hope these advanced access points will soon become accessible for all developers to build upon.

[Figure 9 — Transcription vs Voice Platform]

Analysis

  • Google is better at exposing the raw user input, within the scope of voice platforms for developers.
  • Amazon has better transcription as a technology service.
  • Trying to get ASK to pass through open-ended raw user input is an off-label exercise.

For VUIs seeking to include complex open-ended dialogue in unconstrained domains, Google Actions offers a voice platform that allows developers to easily obtain the raw text of what the user spoke. Having access to the raw ASR output, without having to define a specific language model, enables developers to build advanced features without being constrained by the particulars of the SDKs. For example, with the raw text, a voice agent could ask users open-ended feedback questions about certain features instead of robotic yes/no prompts, aggregate sentiment analysis on the responses, and proactively learn to customize dialogue based on the feedback.

Amazon ASK does not currently expose an analogous mechanism for elegantly acquiring what the user said. In closed domains and well-defined transactions, this does not pose as much of a drawback. It is fairly easy to scope out a pizza transaction into specific intents (toppings, crust, payment, confirmation, etc.). As applications move beyond informational phone-tree menus towards richer engagements, the inability to get the raw input starts posing more of a challenge.

While there are numerous approaches to models that can be defined on ASK that appear to match broad domains, they do appear to do a much poorer job of passing through unfiltered user input, and, as one would expect, are biased towards the sample utterances and slot values of the required language model. As their documentation indicates, the Alexa “recognizer guarantees coverage on all utterances specified by a skill developer.” (10)

It is unclear if the structure of the Alexa ASR and NLU subsystems allow for a developer to create models that would minimize the recognition refinements and just pass through the ASR output. We expect that Alexa Prize contestants have access to the ASR output for building conversational bots, perhaps via private APIs. However, none of our machinations for approximating raw input perform as well as we hoped.


Conclusion

For advanced voice interface applications, Google exposes higher quality raw user input text, in the domain of Google Assistant and Amazon Alexa SDKs. However, outside the constraints of their voice platforms, Amazon offers a higher quality transcription service.

At Cuida Health, we believe that smart speakers are the initial wave of a broader shift towards voice interfaces. Given the novelty of voice platforms and unfamiliarity of most developers with natural language processing, it is not surprising that the initial wave of VUIs are mostly transactional state machines.

We believe that even greater value will be unlocked, once VUIs start supporting open-ended conversational engagements, on top of basic transactional features. The experiences should be dynamic. The dialogue should be different each time. The agent should have a personality and be less scripted. Once users start to engage over VUIs rather than transact, a much broader set of possibilities open up.


About Us

Cuida Health is a leader in optimizing voice-first technology for seniors.

Our mission is to help seniors age comfortably in their homes by staying connected to the community around them and focusing on both their emotional and physically health.

LiSA (Language Interface for Senior Adults) is an entertaining and engaging virtual persona that is both a coach, a companion and an in-home assistant that works on both the Amazon Echo and Google Home devices.