Voice-enable your APIs with Amazon Alexa
Like most developers I think in terms of APIs. It’s a shorthand that helps describe what a service does in a way that is consistent and familiar. If done right, the API describes things in terms of nouns and verbs. When I look at the API for my local library, for example, I will see that can get a list of locations or fetch a single book.
It’s well understood how to invoke an API. The simplest shorthand is curl.
Even without reading the docs you get a sense of an API’s shape (this one likely searches for chicken recipes with fewer than 1500 calories). It’s a developer-to-developer shorthand, that makes it easy to share and connect.
Such conveniences form the backbone of the API economy. Web apps and mobile apps alike connect and use APIs. But how do you surface an API using voice? How do you surface it in ways the user can reasonably understand to make it “usable”? How do you voice-enable your API with Amazon Alexa?
The Recipe Use Case
I’m a chef when I’m not at work. It’s a way to unwind that takes me out of my head and into the moment. It permeates a lot of what I talk about and ends up making its way back to my work (I use cooking examples a lot in my code).
So true to form, I’m going to surface the Edamam.com API with a verbal user interface (VUI). If you’re not familiar with Edamam, they provide recipe and nutrition analysis APIs to individuals and businesses. You can search for recipes by ingredient, calories, or a facet like high-protein.
You can also search for recipes directly on their Website which gives a good sense of how to surface their API visually. They’ve approached it consistent with what one would expect from an API such as theirs. There’s a search box at the top for the query, some checkboxes for various filters, and a list where the result are displayed.
Clicking on an item will load the specific recipe along with instructions and ingredients.
Verbal with a Twist
In order to surface the Edamom.com API as a verbal user interface, the same basic patterns must be supported. The user must be able to search, refine, choose, and view. It’s straightforward in principle, but Alexa adds complication in that it supports both visual and non-visual devices. This influences what is considered “usable” as certain design patterns do not translate well to all devices.
For example, here is the verbal interaction to find a recipe using the Echo Dot (which has no visual UI). Notice how it avoids reading a long list of choices once the results are found (NOTE: the Alexa Skill name for all examples in this document is “My Ratatouille”).
For users with an Echo Show (which has a screen), the output is adjusted and the user is prompted to choose from a carousel of options. They can browse the options by scrolling back and forth.
Regardless of the edge cases, VUI development is reasonably straightforward once you understand the basic interaction patterns. I’ll conclude this document with a high level description of the major development steps. I’ll start with how to configure Alexa and then finish with how to take action and communicate back to the user once Alexa resolves the user’s intent.
Define Intents and Slots
The Alexa development tools help you take loosely-structured speech and convert it into organized intents with predictable values (slots). For the recipe skill, the intents listed here reflect the types of interactions necessary for a user to find, choose and review a recipe. They describe the interaction from the end user’s perspective. And they should be similar to the types of interactions that already exist on the visual Web site. Here are just a few.
- Find recipes
- Choose a recipe (“IntwixtSelectMenu” Option)
- Show the ingredients
- Show the instructions
- Send the recipe
- Stop the conversation
When defining intents, it’s important to understand the extra information the user will need to provide to properly communicate the intent. For example, our users don’t just want to find recipes; they want to find recipes containing a specific ingredient. This is referred to as a “slot”, (and in the case of find recipe intent, the slot is “ingredient”).
Once you declare an intent, you must provide sample utterances and patterns. Alexa then does the rest, using machine learning to determine the most likely user intent at runtime, even if the utterance isn’t an exact match.
Alexa encourages you to define the category for each slot. This helps Alexa better interpret what the user means by limiting what to expect. For example, the ingredient slot used for the find recipes intent is defined as an AMAZON.Food item, which helps Alexa understand less-common utterances like “chicken cordon bleu” when spoken out of the blue .
Define Action Handlers
Once Alexa determines the user’s intent and any slots required to fulfill the intent, it will contact an action handler of your choosing. This is referred to as the “service endpoint”, and you configure it at the time you define your intents and slots. This is where you do your work and fulfill the intent.
Alexa recommends that developers use the AWS Lambda service to fulfil the user’s intent (to take “action”). Typically, this means the developer authors a Lambda function in the language of their choosing (such as NodeJS). For example, in the case of the recipe skill, the Lambda function could call Edamam.com and return a list of matching recipes.
This approach works well in most cases. But when the skill requires frequent and repeated interaction it can get cumbersome to manage the complexities of the conversation. It can be dynamic with interruptions and changes in intent. Given these challenges, we like to organize the interaction using process modeling semantics. Not only is it easier to visualize the interaction at the level of business, it’s possible to design reusable modules to augment and adapt the use case (more on that later).
Here is the central module for the recipe application. This is our service endpoint that is called by Alexa once the user’s intent has been determined (1). The router is then called (2) to send the intent to the proper module. For this example, let’s assume the user said “Find Recipes”, which will cause the “Find” module (3) to load.
The “Find” module is tasked with calling Edamam.com and returning a list of recipes (see below). If the slot has not yet been filled, we can prompt the user for the ingredient (2). As is always the case, the user can say anything, including asking to stop (3b) or asking for help (3c).
If the user provides the ingredient (3a), the search will be executed. If no recipes were found, the user is routed back to the main module where they are prompted to choose a different intent (4). If recipes were found, the call is routed to the “Choose” module where the user will be prompted to make a selection (5).
We design our conversations to stay shallow, making it easy for users to change contexts. Regardless of the module’s primary purpose, each module will typically allow the user to randomly navigate in any direction for optimal usability.
For example, here is the module that shares the recipe ingredients. Once the user is presented the list (1), we prompt them to send the ingredients to their phone. They can choose this primary path (2), but they can just as easily ask for help (3a), ask to stop (3b), or specify any other intent (3c). When that occurs, the call is sent to the central router where it is then forwarded to the appropriate module, starting the cycle all over again.
This is the full set of all modules for the skill (created using Intwixt). By default, this skill is designed to call the Edamam.com API (the left module highlighted below).
However, because this is a pluggable system, the other modules have no knowledge where the data is coming from. It is easy enough to search the New York Times instead (the right hand module highlighted above). Here are the results, indistinguishable from those that come from Edamam.com
In fact, the modules created for this skill were designed to have no knowledge of what is being searched for. Instead, they provide a generic set of capabilities for searching, refining, and viewing lists of instructional information using a verbal user interface. It might be recipes, but it might also be a list of invoices, an employee lookup service or whatever API I decide to surface.
So many use cases can be enhanced with VUI. Personally, I use this skill for cooking and regret not building it sooner. It’s the perfect medium when I need to be hands free. But my favorite part is that I can add yet another module to search my own recipes. I finally have a way to access my own library, and another, and another, and another….