Skillful Development for Amazon Alexa

In this post I want to share some of what we learned when developing our first skill for Amazon’s Alexa voice-assistant platform. This is not a tutorial. I rather want you to get a sense for what challenges are involved in developing Alexa skills.

If you have played around with Alexa skill development before, I would be curious to hear from you, about what your experiences are!

Amazon’s little helpers

There are multiple ways to write code for an Alexa skill. For all that Amazon themselves care, you just need to provide a REST endpoint and setup your skill to point to that endpoint. As long as your API sends Alexa valid responses, it doesn’t matter in which language or with what kind of framework you wrote this API.

One option for hosting your skill’s API is of course Amazon’s own AWS cloud and that seems to be kind of a good fit. When you develop an Alexa skill you are dependent on Amazon anyway, so why not go all the way?! Now you can start implementing your skill by simply defining handlers for specific events such as onIntent for when Alexa sends you a user intent.

Because this approach was likely to include writing some amount of boilerplate code over and over again, we opted to use Amazon’s own JavaScript SDK. The SDK — or what it’s mostly called by Amazon themselves: the Alexa Skills Kit (ASK) — lets you define intent handlers directly instead of reacting to individual events. It also gives you a couple of default response methods (like :tell and :ask) and some (rudimentary) form of state management. This all seems nice.

In our skill we ask the user a bunch of questions and use the answers to compute a result value. When we started to implement this, we were initially not happy with what the SDK gave us. For every step in the conversation we would need to set the state, define intents and slots and then check the state in the actual handler code. As we were not sure, how tedious this might get — turns out, it’s actually not that tedious once you understood how all of this works — we looked for a library or something, that could help us with that. And lo and behold: Amazon themselves provide a special mode for a skill that is specifically designed to support developers in situations like ours, where we wanted our skill to collect some information from the user. This special mode is the “Dialog Model” and the way it works is, that your code basically passes off the control flow to this model and it will then ask the questions you defined in the skill-builder. I will talk more about the drawbacks of using the skill-builder in a moment, but what specifically sucks about the dialog model, are limitations like that when you use it in your skill, you can no longer ask the user yes- or no questions at all! But our skill consists of more than just this one dialog and there were features for which we do want to ask the user yes/no questions. Just not possible. This meant we could not use the dialog model and had to implement the conversation flow ourselves. This is not a big deal in and of itself, but why even have this dialog model at all, if it restricts skills so much?!

Do-it-your-state

Once we realized, we couldn’t make our skill work without using yes/no questions, we needed to give up on the dialog model and build the conversation flow logic ourselves. The state management that the SDK gives you is fairly minimal and would have required us to write a somewhat verbose and unfocused style of code.

Because wewanted to have state management with enough syntax-sugar to no longer need to furiously copy-and-paste code together when adding states or individual intents, we decided to write a small helper library that offered exactly that. The main achievement of this library is that it allows you to create state hierarchies (which will not necessarily be reflected in the user’s interaction with the skill, but just in the code), where states can inherit intent handlers from their parents or choose to override them.

This is not to advertise our specific solution. I just have the feeling that if you want to be somewhat serious about Alexa skill development, you will end up writing similar utility functions yourself.

But enough about the technology. The most fundamental conceptual challenge in creating our skill has been, how to strucuture the user interaction. This is maybe due to our lack of experience in designing voice interfaces. But there are some really interesting issues that come up:

Conversations inherently progress over time and thus it is tempting to focus strongly on a specific sequence of questions and answers when designing a conversational interface. Turns out, this is not the best experience for the user — at least based on my experience, talking to our own skill.

I have not spend too much time researching the topic of VUI design so take the following thoughts with a grain of salt, but I came up with the following guidelines:

  • You don’t want to restrict the user to saying or asking stuff in a specific order. Not only does the user potentially want to use a different order, but this makes it really feel like you are not talking to an intelligent thing, but to a script. You rather want to allow the user to jump around all the different statements or questions that you are prepared to deal with.
  • You don’t want to tell the user exactly what he is supposed to say. This makes your skill feel like a glorified phone answering machine. ”For more information, press ‘1’”.
  • You don’t want to structure your skill in a strictly hierarchical way. Compared to a graphical interface where the user can look at the menu structure in order to understand the information hierarchy of a site or app, voice interfaces seem to be not really discoverable. If you don’t want to dictate to the user what he is supposed to say, he will have to guess sometimes. A conversational interface feels smart, if the developer anticipated correctly, what the user will say.

Designing a skill that does these things is not easy. One conclusion might be, that maybe Alexa skills just should not be “large” enough to be affected by these issues, but rather focused on doing one thing. But maybe not. Mabye it is just about figuring this stuff out. This can be a really exciting time where developers (and/or UX designers?) have to learn about a whole new way of interacting with the user.

code > web

Probably my biggest pet peeve with the current development workflow for Alexa skills is the dependence on Amazon’s web-interface. While deploying the actual code for your skill works just fine using Amazon AWS Lambda, not everything concerning your skill is contained inside the code. Specifically: You have to build the interaction model, specifying all your skill’s intents and slots, using this web-interface. And not only is this site kind of slow and the interface a bit confusing, but more importantly: You can not automate anything about interacting with this skill builder and that there is no way around that!

You will typically have your code in git and deployable to Lambda via a script, but

  • it is not possible to version your interaction model,
  • it is not possible to keep a history of past revisions of your interaction model,
  • it is not possible to automatically deploy your interaction model from a piece of code and
  • it is thus not possible to keep the code and the interaction model in sync.

Everything you do to achieve these things is basically you manually copy-pasting stuff from one place to another. This is not how you efficiently develop anything. I seriously do not know, why Amazon does this and — at least for me — this is a very big turn-off.

The same goes for testing. Amazon does not offer any library for unit-testing skills. The way we tested our skill was either by directly talking to Alexa, which can get annoying pretty quickly and does not give you any meaningful information in case something goes wrong. Or we used Amazon’s web-interface again. The Alexa developer portal includes a text-input based skill tester, where you write what you would say to Alexa, it generates a proper JSON object with your request, sends that to your endpoint and then presents you with the JSON response. This is not bad, but when the endpoint fails, you will just see a generic error message. In order to debug the code you need to go to where the code itself is executed: In our case, AWS Lambda. In Lambda, you can create test events for your functions that consist of a JSON request object that gets sent to the function. What we ended up doing in order to debug our code, was to copy the request we wanted from the Alexa testing form and paste it into the Lambda test event configuration. This process is annoying as hell.

Ideally, I would like to do two things locally, on my machine:

While developing a new feature, things may not initially work and I want to create requests in order to specifically trigger the failing case in order to get an insight into what is wrong. For that I would need a local way (without using the testing form on the website) to generate compatible JSON requests.
And when changing something internally in the code without affecting the user-interface (i.e. the conversation itself: the speech samples, responses or the conversation flow) I want to run a suite of test cases against my skill in order to verify, I didn’t inadvertantly break anything. In both cases I would need a way to locally send requests to my skill and receive a response.

Now this is possible — I guess — if you build it yourself. But I expected more help from Amazon’s side. As it stands now, because of these issues, developing a skill for Alexa is kind of a pain!

The ugly truth

Apart from the limitations in the way Amazon wants you to create your skill in general, there were some major (more unintended) inconveniences that cost us time and a lot of nerves.

I just said that testing Alexa skills is a pain because you are forced to use the testing interface on their website. But to make it even worse, this interface occasionally just wouldn’t work. Sometimes it responded to a request with something like “The remote endpoint did not respond.” although testing the Lambda-function directly showed that it was working just fine and returned proper Alexa responses. And sometimes it could not even generate a request for the text input provided to it. These outages did not usually last very long, but it suffices to decrease your confidence in your own skill actually working.

Speaking of outages: One weekend, the skill-builder broke skills, so that when you rebuilt the interaction model, your skill just would not work. It took Amazon around three days to fix that. This does not feel like a productive environment.

Above I talked about how using the dialog model prohibits you from using yes/no questions with the AMAZON.YesIntent and AMAZON.NoIntent. Well, I tried to do it anyway, just to see what would happen. And guess what: It permanently destroyed our skill! Not only did the skill not work with the AMAZON.YesIntent or AMAZON.NoIntent in place (which would not have been unexpected, because after all that’s what they say in their documentation), but even after removing these intents again, the skill remained broken and was unfixable. We had to setup a new skill and re-enter the whole interaction model in the skill builder. And if our skill had already been in production we maybe would have had to submit it to Amazon’s certification process again. Seriously, Amazon?!

In the end, it doesn’t even matter…

After having complained quite a lot, I do have to say though, that after all the hassle that we went through, we did manage to build the skill and publish it to the store. And as we do see value in voice assistants, we will continue to work on our skill.

Personally, I would like for the technology and workflow to mature until I consider working with it again. And I do not say that lightly, as I usually see more value in a technology’s innovations, that in my eyes outweigh many of the immaturaties of an ecosystem in its early stages. But with Alexa I just can not come to any other conclusion as dreading the time I spend working on it.

But: I also think that the fate and future of the whole area of voice assistants will not be driven or even decided by developers. What it really comes down to is, whether these interfaces provide value to the user and if users embrace them as a preferred interface (even if just for select use-cases). And if the user’s demand for more and higher quality voice assisted interfaces grows, so will the pressure on developers as well as on platform owners like Amazon to deliver. I am optimistic that Amazon will improve the developer experience, but in the end, only time will :tell.

UPDATE: Since the writing of this article, Amazon has published the ASK CLI, that will potentially solve a lot of the issues I complained about.