alexafsm, A Finite-State Machine Python Library for Building Complex Alexa Skills

Vu Ha

Published in

AI2 Blog

6 min readMay 11, 2017

Question to Alexa:

Alexa, are state machines the secret behind your skills?

Why?

In building Alexa skills, developers have a number of toolkits to choose from. For Javascript developers, the Alexa team offers the official NodeJS SDK, which provides support for handling session attributes, skill state persistence, response building, and behavior modeling. For Pythonistas, John Wheeler’s Flask-Ask seems to be the go-to option, as it is built on top of Flask, a popular micro-framework for building web applications. While these tools are excellent choices, they leave out a key aspect of (modeling) complex conversational logics with more than a handful of intents and states: finite-state machine (FSM). Enters alexafsm, an open-source Python library from the Allen Institute of Artificial Intelligence. The elevator pitch for alexfsm is:

With alexafsm, developers can model dialog agents with first-class concepts such as states, attributes, transition, and actions. alexafsm also provides visualization and other tools to help understand, test, debug, and maintain complex FSM conversations.

How?

FSM is a fairly straightforward concept that many developers and computer scientists are familiar with. According to Wikipedia, it is “an abstract machine that can be in exactly one of a finite number of states at any given time. The FSM can change from one state to another in response to some external inputs; the change from one state to another is called a transition. A FSM is defined by a list of its states, its initial state, and the conditions for each transition”. In the context of modeling dialogs, these concepts carry the following meaning:

A state (sometimes approximately modeled with context, e.g. in Google’s API.AI Action SDK) is the state of the dialog. Examples are initial state, end state, waiting for some user input, and so on. States are often annotated with attributes, such as previously provided intent/slot values. Alexa uses request['session']['attributes']construct to pass such attributes back and forth with a skill’s backend.
External inputs are user inputs. Alexa user inputs typically arrive to the FSM in the form of parsed intent and slot values. External inputs are also called triggers or events in FSM parlance.
User inputs are handled by a skill developer by writing intent-handling logic that makes a transition to a new state.

For complex dialogs, the FSM approach offers an attractive modeling paradigm, enabling the developer to focus on the essentials of the dialog logic. When we started prototyping Alexa skills at the institute, our initial prototypes started out as a straightforward collection of intent handlers. As our prototypes evolved, these intent handlers grew in complexity, specifically in the number of if-then-else statements and the depths that these statement are nested in. These nested statements are effectively handlers of different states and contexts, routing for example a simple “Yes” user confirmation directive to the appropriate logic depending on the dialog state. We saw that this can quickly become a challenge to write, test, debug, and maintain all but the most simple kind of dialogs, and an FSM approach would be highly desirable.

After some research, we settled on using Tal Yarkoni’s Python FSM library called transitions. Here’s an example of how alexafsm is used to build an Alexa (meta) skill that helps users search for skills. The dialog enters the state below when the user is informed about a skill that may match her need.

Example of using alexafsm to implement state transition logic. The state “has_result” can be reached via different routes with corresponding conditions and prepare/after actions.

Note how we captured a number of ways the dialog enters this state in a succinct, declarative way! alexafsm’s GitHub repo contains the complete implementation of the search meta skill.

What Else?

In addition, we made the following decisions:

We wrote a FSM validation tool that performs a thorough check of the following: 1) All Alexa intents have corresponding events/triggers in the FSM. 2) All states have either inbound or outbound transitions. 3) All transitions are specified with valid source and destination states. 4) All conditions and prepare/after actions are handled with methods in the Policy class. Once the FSM passes validation, we can be confident that the dialog agent is free from a large number of coding error cases.
In addition to the visualization feature of transitions (via pygraphviz), we wrote a simple toString formatting utility for FSM which we found to be easier to use than a complex diagram. Check out an example of this format.
VoiceLabs analytics is supported out of the box. VoiceLabs is indispensable for skill developers to understand how users interact with their skills so as to continually fine-tune them and optimize user experience and retention. VoiceLabs’ Last Path for example helps identify if the users request something that the skill did not support, if the user were confused, repeating themselves, getting frustrated, etc. This type of analytics insight dovetails perfectly with alexafsm’s features. A potential evolution for VoiceLabs is to support FSM-based dialog agents, for example by visualizing states and transitions instead of intents.
alexafsm is light-weight and agnostic with respect to the choice of web frameworks. Our example skill is written with Flask, but CherryPy, Pyramid, etc. are also straightforward to use with alexafsm.
As we were getting started with a new research dialog project at the institute, Python 3.6 hit the shelves (last month, Amazon announced support for Python 3.6 in their Lambda serverless offering — how timely!). We adopt it due to more extensive support for type hinting and string interpolation. String interpolation is just so handy in writing speech patterns, essentially obviates the need for templating solutions (such as Jinja) that keep speech assets separated from code. See which option you prefer:

String interpolation is handy for Alexa speech templating

alexafsm development was based on Audrey Roy Greenfeld’s cookiecutter-pypackage from which we picked support for tox/pytest/flake8 and PyPI & Travis integration.

When Not?

For developers building skills with only a couple of intents and dialog turns, alexafsm would not be as compelling. For those interested in taking a crack at the Alexa prize, building a social bot that can hold a conversation for 20 minutes, alexafsm may not be a good fit either (we dread to think about designing an FSM for 20 minutes chat). For other scenarios, give alexafsm a try and see if you like it. We like it a lot for our skill search meta skill.

What’s Next?

alexafsm’s GitHub repo has more detailed documentation and includes an FSM-based implementation of an Alexa skill that helps users search for skills. We’d love to hear your feedbacks, suggestions, and pull requests.

Looking farther into the future, let’s take a look at the following schematic diagram of spoken dialog systems, taken from a recent research paper on the institute’s Semantic Scholar’s service:

See if you can spot the role of the FSM in this diagram!

The concept of state machine is front and center in (spoken) dialog research. Historically, research focused on modeling and handling uncertainty that comes from speech and language understanding errors using the decision-theoretic framework of (Partially Observable) Markov Decision Process. This framework assumes hand-crafted states and transitions. Recent trends in deep reinforcement learning and end-to-end neural training of dialog systems aim to overcome the expensive hand-crafted FSM by attempting to learn directly from data (the no-pipeline crowd). The reality check is however telling us that most commercially deployed dialog systems are still scripted with rules, with explicit FSM modeling optional.

The field of AI dialog agents is perhaps at its most exciting point in a long time. Big players are betting heavily on a future with ubiquitous voice-first experience (Google Assistants, Microsoft Cortana, Apple Siri, in addition to Amazon Alexa). Startups are experimenting with a dizzying array of conversation-based user experiences. The gap between industry and research remains large, but hopefully will shrink in the future. With alexafsm we hope to play a small part in shrinking this gap. Perhaps the answer to the question