Alexa Skill State Machines, Explained

Declan McKelvey-Hembree
6 min readMay 6, 2018

--

NOTE: The below assumes you are working in node.js with the alexa-sdk npm package.

Last week at work I had to develop a proof-of-concept Alexa skill for a pitch. I didn’t think it’d be a particularly tough build out, but that was before I really took a close look at Alexa’s documentation, or total lack thereof. The ASK (Alexa Skills Kit) seems to be in a constant state of flux, changing and iterating so rapidly that the Amazon documentation is flimsy and most of the tutorials and code samples are out of date.

Of course, that means that this article will likely soon also be obsolete, but it’s current at the time of my writing this (5/6/2018), so I’ll record this for posterity anyway.

Depending on which Alexa introductory workflow you went through, you may not even know that the Alexa conversation handling model supports state machines at all — The Amazon-supplied example fact skill doesn’t include them at all. Neither does the example quiz skill. The adventure game example uses a state machine with a single state and a bunch of self-looping transitions. However, for complex, multi-turn conversations, the built-in state machine functionality is one of the best ways to keep your code from becoming an unreadable mess.

Intents And Handlers

In most of the Alexa examples, handlers look like this (from the fact skill):

const GetNewFactHandler = {
canHandle(handlerInput) {
const request = handlerInput.requestEnvelope.request;
return request.type === 'LaunchRequest'
|| (request.type === 'IntentRequest'
&& request.intent.name === 'GetNewFactIntent');
},
handle(handlerInput) {
const randomFact = cookbook.getRandomItem(data);
const speechOutput = GET_FACT_MESSAGE + randomFact;
return handlerInput.responseBuilder
.speak(speechOutput)
.withSimpleCard(SKILL_NAME, randomFact)
.getResponse();
},
};
//...some other handler consts and then...
exports.handler = skillBuilder
.addRequestHandlers(
GetNewFactHandler,
HelpHandler,
ExitHandler,
FallbackHandler,
SessionEndedRequestHandler
)
.addErrorHandlers(ErrorHandler)
.lambda();

In this system, each handler corresponds to an intent, i.e. there’s one handler for each thing your skill can do, and which handler gets used is decided by the boolean return values of the canHandle functions. When this lambda function gets a request from an Alexa, it will iterate over the list of handlers that were added in addRequestHandlers and use the first one whose canHandle function returns true.

It turns out that this one-to-one mapping of intent to handler doesn’t make much sense when working with conversations — For example, saying “Yes” or “No” in a conversation is sent to the lambda function as an AMAZON.YesIntent or an AMAZON.NoIntent. In the model above, that means that you’d have a single handler code block for whenever your user said “Yes” or “No” — if your skill has a user say ‘yes’ or ‘no’ in more than one conversation, the first thing your handler code has to do is figure out which conversation it’s in. Which, aside from leading to some weirdly-organized code, is extremely counter-intuitive. When I ask someone a question, I’m waiting for a response to that specific question, not just perking my ears up whenever someone says ‘Yes’ and trying to figure out what question they are saying yes to.

So this way of setting up your handlers, though it works well for simple “command” skills, seems likely to get extremely confusing once you start trying to have multi-turn conversations with your skill.

Other Alexa examples (like the below from the adventure game skill) have handlers set up like this instead:

const handlers = {
'LaunchRequest': function() {
console.log(`LaunchRequest`);
if (this.event.session.attributes['room'] !== undefined) {
var room = currentRoom(this.event);
var speechOutput = `Hello, you were playing before and got to the room called ${room['$']['name']}. Would you like to resume? `;
var reprompt = `Say, resume game, or, new game.`;
speechOutput = speechOutput + reprompt;
var cardTitle = `Restart`;
var cardContent = speechOutput;
var imageObj = undefined;
this.response.speak(speechOutput)
.listen(reprompt)
.cardRenderer(cardTitle, cardContent, imageObj);
this.emit(':responseReady')
} else {
this.emit('WhereAmI');
}
},
'ResumeGame': function() {
console.log(`ResumeGame:`);
this.emit('WhereAmI');
},
//...other intent_key:function_value pairings
}
//...and then...
const alexa = Alexa.handler(event, context);
alexa.registerHandlers(handlers);
alexa.execute();

In this scheme, there’s only one ‘handlers’ object, which contains all the intents as keys, with functions to execute if that particular intent is passed to the lambda function. This essentially functions very similarly to the previous system, just with different syntax. However, that means it also shares all the problems and limitations described above. Clearly, we need something else.

Like this:

const Alexa = require('alexa-sdk');//handler for initial state
const newSessionHandlers = {
'LaunchRequest': function(){
this.response.speak("Try asking for a fortune cookie.")
.listen("Try again.");
this.emit(':responseReady');
},
'FortuneCookieIntent': function(){
this.handler.state = "HAPPIER_FORTUNE";
this.response.speak("Here's a neutral fortune: You love Chinese food. Would you like a happier fortune?")
.listen("Try again.");
this.emit(':responseReady');
},
'Unhandled': function(){
this.response.speak("I'm not sure how to help you with that. Try asking for a fortune cookie.")
.listen("Try again.");
this.emit(':responseReady');
}
}
//handler for fortune cookie followup question
const happierFortuneHandlers = Alexa.CreateStateHandler("HAPPIER_FORTUNE", {
'AMAZON.YesIntent': function(){
this.response.speak("You will live long enough to open many fortune cookies.")
this.emit(':responseReady');
},
'AMAZON.NoIntent': function(){
this.response.speak("You will die alone and poorly dressed.")
this.emit(':responseReady');
},
'Unhandled': function(){
this.response.speak("You can either say yes or no.")
.listen("You can either say yes or no");
this.emit(':responseReady');
}
}
exports.handler = function (event, context) {
const alexa = Alexa.handler(event, context);
alexa.registerHandlers(newSessionHandlers, happierFortuneHandlers);
alexa.execute();
};

So this system defines multiple handler objects using the Alexa.CreateStateHandler function. Crucially, this function accepts a state name as its first argument. This means that, in our example, we can define one behavior for a user saying yes at one point in a conversation and another behavior for a user saying yes at a different point in the conversation. The actual state change occurs by changing the this.handler.state value. Let’s take a closer look at this example fortune cookie skill, and show how states solve problems for our conversational model.

Conversation States And Intent Transitions

State diagram for the example fortune skill. This waits for you to ask for a fortune, gives you a fortune, then gives you a happier or sadder fortune depending on whether you say yes or no to Alexa asking you if you’d like a happier fortune.

So, to be precise, the this.handler.state property stores our current state, and changes to this property occur in handler code that’s triggered by intents. So intents are actually our transitions between states.

The other bit of magic that allows this to work is the ‘Unhandled’ function you can define in your state handlers. It functions as a sort of catch-all, and it runs when the state receives an intent it doesn’t have a specific function for. This allows you to essentially make every intent that a state doesn’t have a specific function for self-loop. For example, in the fortune skill above, AMAZON.YesIntent and AMAZON.NoIntent both trigger the ‘Unhandled’ function and self-loop. It also has the bonus side effect of avoiding a hard crash, which is what happens if a state doesn’t have an ‘Unhandled’ function and gets an intent it doesn’t have a function for.

The big advantage here is that you can keep track of where you are in a conversation and only react to the intents that make sense for that part of the conversation. This means that you don’t have to figure out what to do, if, for example, the user tries to open the conversation by just saying ‘yes‘.

So, to bring it all together, here’s a template for building these state machines:

const Alexa = require('alexa-sdk');const initialHandlers = {
//note that we're not using Alexa.CreateStateHandler here,
//because this is our initial, entry state the user is in when
//they first invoke the skill
'intentA': function(){
//code to react to intent A here.
//change this.handler.state in here to
//create a transition to a different state.
},
'intentB': function(){
//code to react to intent B here.
//change this.handler.state in here to
//create a transition to a different state.
},
//...as many intents as you'd like
'Unhandled': function(){
//code to run when an intent doesn't
//have a specific function above
}
}
const stateXHandlers =
Alexa.CreateStateHandler("STATE_X_NAME", {
'intentA': function(){
//same as above
},
'intentC': function(){
//same as above, note that you can
//have an arbitrarily different set of
//specific intent functions in each state
},
//...as many intents as you'd like
'Unhandled': function(){
//same as above
}
}
const stateYHandlers = {
Alexa.CreateStateHandler("STATE_Y_NAME", {
//again, whatever intents you'd like in here
// and an Unhandled function
}
exports.handler = function (event, context) {
const alexa = Alexa.handler(event, context);
alexa.registerHandlers(
initialHandlers,
stateXHandlers,
stateYHandlers,
//...as many state handlers as you'd like
);
alexa.execute();
};

Further Reading/Documentation

I’m not sure why the Alexa tutorials don’t make mention of this extremely crucial feature, but you can find more examples and explanation of the state machine functionality near the bottom of the npm package page.

--

--