Getting Started with Amazon Alexa Skills
Ever since I saw the trailers for Alexa, I’ve wanted to know what sort of sorcery went on in the background to make it work. I imagined it would be such a difficult feat; filled with some serious natural language processing and machine learning jitsu. Thankfully, like most modern day technologies, it’s relatively intuitive, well documented, and has a relatively low learning curve! There are plenty of resources out there for learning about Alexa skills. One that I found invaluable was the Cloud Guru course, and it’s free! Amazon has a set of resources that will also get you ramped up.
The long and short of the process is that Alexa has “skills” — at least from a third party developer stand point — I’m not entirely sure how Alexa dev teams program new features into her, but the core technology is likely similar to what us outsiders will be using.
So what’s a skill? I’ll defer to Joy Chen’s post on Alexa skills for more details. The relevant gist here: A skill is a combination of an interface with our voice (or a voice user interface), and logic to handle what was spoken.
The voice interface
I imagined this would be the most complicated part of the skill development, but it was quite simple once I got past that initial fear of trying something new. The voice user interface (VUI) for an Alexa skill is essentially a set of key phrases and target phrase parts that Alexa’s logic handles (using AWS Lambdas) to deliver your desired output. In reality, it’s a very static and fixed outcome from a development experience. You’ll insert a set of phrases, such as those below, and designate a target word or phrase for your Lambda to process. (Note: You’re not limited to a single target word per phrase).
help me get {location}, // 'to the store', 'home', 'to Iceland', etc
let me go {location},
help us getting {location},
how do I get {location},
can I get {location},
fly me {location},
take him {location} please,
she is going {location}
Amazon expects you to have a minimum of 50 of these phrases, and an intention to add-on continuously for the lifetime of the application, and that’s for a very good reason. There are simply too many ways for an individual to say the same thing, and for all the ways an individual can say the same thing, another individual will find another set of ways to say it.
During the one week development of my application, the phrases (aka utterances) for a single intention were growing exponentially. I ended up using an online tool to generate phrases based on permutations. I got a bit fanatical about covering all the potential voice inputs for various users, and for various intents within the application. This however led me to another limitation, which I have yet to find in the documentation…
Alexa’s intent schema, aka the JSON file that stores all of the intents and slots, has a rough limitation of 10,000 lines. After chopping down hours of permutations, I thought through what features my set of intents had, and sought a way to reduce the multiple duplicates of utterances and also to limit the amount of intents needed for each skill.
Make the machine work for you
Amazon’s intents are fed to a machine learning algorithm which creates a model that’s used by the skill to feed the user’s voice input to our Lambda logic! Aha! So, instead of creating permutations of many phrases with target words, I decided to give each sentence part a set of words, and give the words a set of “synonyms”. Ideally, these synonyms would be used to identify similar target items, like puppy, dog, doggy, doge, etc. You get the point, but it could also work to identify sentence parts! Or, at least that was the theory. He’s what this looks like in the developer console for an Alexa skill (followed by an example sentence for each):
The beauty of this approach is that it really works! What’s more, the sentence parts, or slots, are readily reusable! Needless to say, I was able to generate tons more phrases, or rather, let the machine generate these phrases for me, with minimal work on my part. While we had to do this work to manually add fewer utterances per intent (to fit within the limitation), Alexa was ultimately able to handle a hefty number of replies with this approach! Granted, I’m no linguist, so many of the sentences the machine is generating are likely overly repetitive. I went out on a limb and assumed the algorithm maps duplicate utterances, and called it a day. Digging deeper into de-duplication is a problem for another post.
If you’re wondering what my interaction model looks like, here’s a a severely truncated version:
In the example above, the root request is going to be my target word. This is what is used in the Lambda to decide what Alexa will say to the user. The types here only show the pronoun, but use your imagination and pretend that verb, determiner, and root_request have a similar structure as the pronoun type. Not only can I use this structure to reuse types within this skill, but I can also port the types to any of my future apps! Ahh, beautiful. An example of a fully functional interaction model can be found HERE!
Quick Script to format input
For finding synonyms and structuring them in a way that’s easily appended to the intention model JSON file, this tiny python script saved me some time as well:
Now, the logic that makes Alexa speak
Now that you’re ramped up on how we defined our many utterances, let’s get back to handling those intents I showed you earlier. The thing about these target requests and synonyms, is that they’re nested nice and deep within the incoming request from the voice interface. For example, let’s say you have a target word, “cat”, but your user says “kitty”; you’ll have to find the name of the target word, not just the synonym word that was spoken. If you just use the slot.name
value, you’re still gonna get the synonym “kitty”. You can certainly handle this in the Lambda, but instead I chose to retrieve the slot’s keyword (that nested name) instead of handling the synonym for each case. Don’t believe me? Give it a shot, and also check out the docs!
This little snippet will make your life a good bit easier. Again, less code, more skills!
Now, for some Alexa skill SDK knowledge
Before you deploy your Lambda, make sure you’ve imported the Alexa SDK. If you’re using Node, as I am, run: npm install alexa-sdk
. Here’s the root index.js file, and a dogmatic way of handling multiple handlers. You can use the handlers to divide separate functionalities of your skill. For example, one handler handles cats, the other handles dogs!
You could also handle your cats and dogs in the same file, but… you know… it could get messy.
Duplicate code?
Some of the intents might share some of the same logic. In a previous project, Shaun Husain used Object.assign()
to remove generic code from multiple handlers. Again, less code, and a more manageable code base:
Below is a more common use case. Amazon’s built-in intents have the same output and the same input, so we wont need to duplicate it in all other handlers. All you have to do is import the generic handler to your specific handlers, and extend the specific handler with your generic handler intents — And voila! I’ll show this trick in action in the next code snippet:
Extras for the inquisitive mind
Aside from the basic structure and dogma of the Alexa SDK, I wanted to delve a little deeper. I assumed that Alexa was able to handle anything that a standard, modern day application should — Meaning it can integrate with external APIs to retrieve information, and also send information to external APIs. Think I’m crazy? You’d be wrong!
I started by using node-fetch
to retrieve some fake data for a pretend blog from JSON placeholder, and then had Alexa speak the title of the fake blog post. Here’s my handler for the request, and Alexa was thankfully cooperative:
What next? Let’s see if we can get her to send me a text message. If you’ve never used Twilio, it’s a well refined API to communicate via SMS texts or handle voice input, and it’s fun. First, set up your Twilio account. Then set up an intent and the Lambda to send your message. Just a heads-up, my number is not 5555555555; switch it with your own number and start getting text massages from Alexa! Note, Twilio will issue you your own number — just make sure to add that as the from
number.
A bit boring, but essential
A predictable project structure is gonna give you piece-of-mind, especially when you’re working with other team members, but also just for your own sanity. I was inspired by the “react directory structure” for this project. It provided a simple structure to follow, and kept the dependencies directory distance nice and short (so no long “require” statements).
.
├── index.js
├── handlers
| ├── root_handler
| | └── constants.js
| | └── handler.js
| | └── index.js
| | └── prompts.js
| └── other_handler
| └── constants.js
| └── handler.js
| └── index.js
| └── prompts.js
├── utilities
| ├── text_parsing.js
| └── api_utils.js
└── constants
└── app_constants.js
The code was static text intensive, and I did what I could to reduce the amount of static text within the code. Here, the prompts would hold all of the responses that Alexa would be uttering to the user. This kept the handler logic fairly clean.
In closing
A skill is an app that uses the human voice as the user interface of choice. You can get creative with the logic in your Lambda and the VUI alike, pretty much the same as you would with any other app. The biggest difference is how much freedom you need to provide to the user to allow them to interact seamlessly with your skill. Instead of just having one button to do one thing in a graphical UI, you’ll have a potentially large set of phrases, all to handle one thing.
Voice assistants are still a bit of a newer frontier in tech, but Amazon is pushing for an increase in skills developers, so check out their incentives. The learning curve is shallow, about a solid week before feeling comfortable with it, and it has a small and decently documented SDK. Give it a shot!