The heavy-lifting required to build Alexa Skills & conversational interfaces.

Published in

voiceflow

8 min readAug 5, 2018

NLU (natural language understanding) isn’t necessarily the main challenge in moving from web to voice development — it takes thankless gruntwork behind the scenes to bulletproof your services.

Consider this very basic web form:

Slap in some client-side validation to check for a numeric value and you’re done. This problem’s been solved for decades.

Now let’s say you’re building an Alexa Skill and want to prompt for the same information. You hope it goes a little something like this…

Alexa (blue) prompting & user (black) replying.

You create an Intent which takes a couple of entities (aka “slots”) to capture the quantity and unit (metric vs. Imperial) and provide some sample utterances:

I weigh {QUANTITY} {UNIT}
I weigh {QUANTITY}
{QUANTITY} {UNIT}
{QUANTITY}
About {QUANTITY} {UNIT}
About {QUANTITY}
etc. etc.

Having Alexa or any NLU system understand this is quite straightforward.

But that’s not where the work lies. Users & Alexa have other plans…

Voice development is far less predictable

Actually, you could receive several things:

The user could say their weight without specifying the unit (“142.5”), or Alexa could fail to hear their weight correctly (“42.5”), or might correctly identify the unit but not the quantity (“42 pounds”).

Alexa might detect something which isn’t even a number and doesn’t make sense in this context (depicted by the ‘xxx?!’) although it could also be the user deliberately switching topic. Or there could be silence if the user doesn’t reply at all, or Alexa fails to detect their voice.

Very quickly you’ll learn that in voice development, you’re continually building around the unpredictability and unconstrained nature of the input. With the web, you can constrain the user’s input — in this case, to a numeric value (their weight) and one of two select options (“lbs” or “kg”). With voice, it’s unbounded.

You can mitigate some of the risk by providing more guidance to the user in the initial prompt:

At least we’re providing guidance that we’d like the unit (pounds or kilograms specified) and that decimals are also accepted. It still doesn’t mean the user has to provide them.

Anything can and does happen: people will continually surprise you, and the tech’s not at 100%, no matter how much you work on NLU.

Hey, even we humans mishear.

So let’s return to our conversational flow and consider the implications for us:

The grunt work — data validation

From left to right:

If we received a numeric quantity but no unit, we prompt for the unit (pounds or kilograms)
If we got both numbers and quantity, we probably need to confirm to the user that what we heard was correct. (“probably” depending on the cost of making a mistake.)
If what we received was neither a quantity nor valid unit, try again.
If necessary, we can provide a bit more guidance in the reprompt.

Actually, there’s another possibility too:

If the number sent to us is so absurdly small or large (e.g. 2500 or 2.5) that it can’t be right regardless of the unit, there’s no point asking the user to confirm the unit …we can check the number’s correct first.

This might be a good time to remind ourselves that all of this is essentially one form field on the web:

Yeah.

So how do we handle this?

Referring to the red, numbered arrows in turn:

If the user told us the unit and we already knew the quantity, we can now seek confirmation whether we heard their weight correctly.
As with any input from Alexa, it’s possible that what we received isn’t recognized at all. In which case, we can ask again (often rephrasing it). But what if that happens yet again? You need to keep track of the number of times you ask any prompt, and exit gracefully if it exceeds a reasonable number (3). This is yet more thankless work — and that’s why there are so many poor skills out there which get themselves into endless loops.
We could learn that we heard the user wrong in which case, we’re back where we started.
In the best-case scenario, we have the correct information — success!

But consider the other options …

I don’t want to write code based on this mess. And neither do you. While a conversational flow is useful when considering the main pathways at the design stage, a developer needs a better way of representing things ….

Enter states

Let’s redraw that in terms of states:

Whilst that appears more manageable, it doesn’t negate the need to do a massive amount of validation.

Btw, this also represents how conversational engines like Dialogflow work: by mapping utterances to intents whilst considering input contexts (ie. incoming states) based on the data already known.

“Aaaah 💡”, you think, “it’s basically a finite state machine and there are libraries for that”. Except… well, they don’t scale very well.

Let’s up the ante, and add another field to our web form:

Woo.

Guess what it means for your conversational flow diagram or state diagram? OK, time’s up, here’s one I baked earlier:

And this doesn’t even consider global intents like “help”, “start over”, “go back”, “cancel”… things which we don’t need on the web because, you know, users have like a back-button, can scroll and don’t have to reply within 8 seconds.

It’s at this stage, that any developer coming from mature web frameworks would be forgiven for asking themselves whether they really want to go to THIS much effort building around every eventuality.

Get ready to manage context

Dialogflow may route based on incoming state but Alexa doesn’t work like that.

Alexa is essentially a flat list of intents. When it hears “yes”, it doesn’t know whether the user is confirming their age or their weight. When it hears “78”, it doesn’t know if you’ve just asked them for their weight or their age..

Bottom-line, if you’re developing for Alexa, it’s totally up to you to manage all context and the routing which comes with it.

The joy doesn’t end there — content management

Remember this prompt ?

Putting aside that it’s not a very natural way to speak, it’s long. It might be acceptable first time you hear it, but you don’t need to hear all of it the second time around. So if you’re building services which will be used repeatedly, it’s good to serve content — including prompts — based on the experience of the user.

You don’t need to do that with websites because advanced users are going to fast-track/shortcut themselves to where they want to go, clicking along their usual pathways while barely skimming the text. The web is non-linear, whereas an audio stream is not.

Even if you have a shorter version for advanced users, they don’t want to hear the exact same wording each day. So you’re probably going to provide some sort of variation in wording.

This is, thankfully, one area where frameworks can help. But then again, you can roll this yourself too.

Some might think this is a tedious chore, but not you, you badass, you love your craft.

So what do you take away from this?

First off, don’t despair. If you’re creating a simple skill, just acting on commands issued by the user, you’ll circumvent most of this. Plus, it’ll get easier with time — as the speech recognition, NLU and developer frameworks further improve.

In the meantime:

Constrain input — that’s the name of the game in conversational UI. You want to minimize the unpredictability. Some good voice UX design can help but just accept that things won’t be predictable.
Manage context — you’re going to spend a lot of time doing this! Websites have a lot of implicit context: where the user is on the site, which links they click — it all lets you know where they are in the journey. In voice, it’s all in your hands.
Validate. You were constraining input with the web form: using select lists and client side validation to constrain input to the server… but you still had to treat any request as potentially suspect. Context & input validation isn’t difficult, it’s just tedious. Like, 1996-CGI-script tedious.
Personalization: we speak about it a lot on the web but how many actually do it? With voice you have no choice — build content for newbie, intermediate and advanced users; for what you already know about them. This can be the fun stuff.
Test, test and test again ! Testing websites is MUCH easier because there are a finite number of paths. You don’t have to allow for Alexa mishearing somebody’s weight as ‘4’ or ‘fort’ . The only way that happens with a website is if somebody hacks your form submission or types in a URL incorrectly — in which case, they get a 404 or 500 … and they can get themselves out of that. In voice, just as in life, anything can happen.

Good luck and thank you for reading this far.

p.s. If you ARE having difficulty with getting Alexa to grok users’ utterances, these tips may help.

Tips and gotchas using Alexa custom slots

I see a lot of similar questions regarding custom slots in the Alexa Developer Forums so am summarizing key points here…

medium.com

How Utterances & Slot samples affect Intent-matching in Alexa Skills

This is perhaps the biggest area of confusion and uncertainty for developers building Alexa Skills. This article…