Alexa Conversations: A Developer’s Opinion from re:MARS

Published in

LibertyIT

10 min readJun 11, 2019

tl;dr: Alexa Conversations are a huge leap forward in giving Alexa more functionality, and making it easier for Alexa users to have a more natural and seamless experience. From what has been shared it would seem that Alexa is getting more conversational by shifting skills to be more transactional. The new format allows Alexa to access pieces of the functionality within those skills in a non-linear fashion, interweaving them with other skills. There are aspects of this that make things easier for developers, however this could have consequences for an individual skill’s ability to control or differentiate themselves. Ultimately however, it will be a better experience for Customers, and as developers that should always be our highest concern.

At the time of writing there are three places those not in the preview can gain information on Alexa Conversations and how they work:

The blogs posts on Alexa blogs
The demo video from the Keynote at re:MARS
The sessions given at re:MARS

I don’t work for Amazon, nor do I have any insider knowledge, so all the information I share below (along with my own opinions) comes from these. Hopefully there will be a lot more information shared in the coming months.

Part 1: The Customer Experience Demo

At the Amazon re:MARS keynote a video was shown demonstrating the new Alexa Conversations functionality. Below I break it down into sections and take a look at how the information is flowing through. I’ve captured the transcript at the top, added in a still from the video on the left, and added information about what I think is happening to the right of that. Any input bolded is given explicitly by the user, any underlined is carried from earlier in the conversation, all other inputs are inferred from context.

We start off with the User asking for showtimes for the cinema. Alexa pulls up this information using the Atom Tickets skill without the user needing to explicitly invoke it by name.

The user selects one of the films, referring to it only by the time. Alexa infers the location and sends through the information to get a price for the movie.

The user is now concerned it might be too late when it finishes up, so checks what time the film is. Alexa is able to infer they are talking about the length of Dark Phoenix and retrieve that value without losing the context of the conversation.

The user now decides to check earlier show times. They don’t need to repeat the film name as Alexa infers they are still talking about Dark Phoenix.

This time they go ahead, select a time and book their tickets. This is all completed through the Atom Tickets skill. However, Alexa then asks “Will you be eating out”. There is no gap in the vocal output of Alexa between the confirmation of booking tickets and asking about eating out. Since it’s likely that the booking response would come from Atom, and the proactive question about eating out from Alexa, this is very impressive. And very interesting.

When the user confirms they will be eating out, they give Alexa specific direction to find a nearby Chinese restaurant. It’s not clear what prompts would have been next if they had just said “yes”. Note that we have seamlessly moved to serving the content from Yelp.

Next the user asks for more information about a specific restaurant.

Although Alexa does not proactively offer to book at this point, it is smart enough to take the current context (looking at information for Mott 32, and a night out for 2 people) and move seamlessly to booking a table through the OpenTable skill.

Alexa then proactively asks if the user wants to book a cab. Ever the helpful user, they explicitly ask for an Uber, and Alexa uses the information it already has to book that without any need for additional input.

At that point Alexa is out of proactive suggestions — and simply asks “anything else?”. The User asks to see the trailer (showing context carried through again), and here the demo ends.

All in all a pretty impressive demo, although obviously showing a very ideal situation!

If Alexa is being proactive how does it decide which skill to move to next? I assume we can expect it to be something similar to the Skills Arbitration that is currently taking place.
How long is context maintained? When does it get “reset”.
What conversations can Alexa handle, and what does a skill developer need to do to get their skill surfaced?

Part 2: The Developer Experience

At re:MARS there were two sessions about Alexa Conversations, “A23: Why ‘Alexa Conversations’ Matters To Customers and Developers” and “M13: Alexa Conversations: How Developers can build natural, extensible voice conversations”. I attended both, and the below information is gathered from those sessions.

In A23 the first thing that was talked about was what makes a Conversation Natural. The presenter called out four things:

Robust (you can use a variety of ways to express the same semantic meaning)
Flexibility (non-linear)
Contextual Dialog (remember information given prior in conversation)
Proactive Dialog (anticipate next action)

Alexa Conversations seeks to improve all four of these things for Alexa users.

They also called out that Alexa Conversations also have the following benefits for skill developers:

Easier discovery for users
Fewer steps for users
A lot less code (Atom Tickets said that they had been able to reimplement with only a 1/3 of the code, moving from 5500 lines of code to only 1700)
A lot less lines of modelling (Atom Tickets said they had gone from 800 Data Points to 13. However it’s not clear what exactly counts as a “Data Point”)

In A23 they showed the below picture:

In M13 they represented the developer’s piece like this:

The blog lays out what a skill Developer needs to provide as this: “With Alexa Conversations, developers provide (1) application programming interfaces, or APIs, that provide access to their skills’ functionality; (2) a list of entities that the APIs can take as inputs, such as restaurant names or movie times; and (3) a handful of sample dialogs annotated to identify entities and actions and mapped to API calls. Alexa Conversations’ AI technology handles the rest.”

1. APIs

For each action a customer could take a stateless API needs to be provided — no doubt so they can be used at any point. If you currently have a skill it’s likely your APIs are not designed as such since you currently have to handle Dialog Management. It was mentioned that you may want to store some state on the backend yourself, but that no state will be passed to you.

No information was given on what format these APIs would need to be in, and if they could be direct calls against a Lambda or something more traditional like a https request.

2. Entities

A key part of any transaction is collecting the right data. You need to clearly call out what entities each API requires. It would appear that Alexa may always collect these on your behalf. For instance, a quote from Atom Tickets said “Because Atom isn’t even brought into the conversation until Alexa already knows the movie, location, time and tickets, we can deliver stellar experience” [emphasis mine].

I can see this working really well with built-in slots, but I wonder if the experience changes for custom ones.

3. Sample Dialog

In this sessions this is where they started - designing a sample dialog. This is great as it puts the focus on the fact that this is a conversation right up front. Apparently you will only need 3–5 example dialogs for each of your key actions.

The sample dialog is then annotated by the developer to show how each of the responses from Alexa would be formed — a combination of an API call and an NLG template.

Alexa then handles the rest of the annotation around the user requests. From the Alexa Science Team’s blog: “Based on the developer’s sample data, the system represents the examples using a formal language that specifies syntactic and semantic relationships between words of a dialog. The representations can be converted back into natural language in many different ways, automatically producing dialog variations that are one to two orders of magnitude larger than the developer-provided data. These are used to train a recurrent neural network for modeling dialog flow”.

We were assured the NLG templates would have a lot of power built in, being able to automatically handle plurals and other language complexities, and even allowing you to just create one that simply is a single variable allowing you to just send back the whole response from your API. We didn’t see any examples of the NLG templates, but I am assuming they will tie into the (still fairly new) APL (Alexa Presentation Language) to handle multi-modal interactions.

We got to see a quick glimpse of the visual editor that would be provided to make annotating the conversations easier.

I was curious as to how the new format would change how skills were built overall, but looking at the screenshot I noticed that all the other options down the left hand side were blurred out. Taking a closer look, it appears very likely that this isn’t a screenshot at all — it’s a mock up. Look at how the skill name is missing from the top and that none of the tabs on the top are selected. This means that the actual screen may change a lot before it gets into main-stream developer’s hands.

The following slide was shown to represent how much less overall work was needed by the developer. I think this is for the Atom skill as the lines of code match up.

If the NLG template include APL, then I’m impressed with the 370 lines. APL is most kindly described as verbose, an amazing idea that quickly descends into something as anger inducing as CSS.

It’s very clear that most of the reduction has come from shifting Intent Handling to instead using an NLG template to map utterance structures to stateless API calls with templated responses. However, it’s unclear if there is any custom dialog handing now or how much information is passed to the skill. Since the APIs need to be stateless I wonder how much information you will be given on the current session, and how that will change the analytics you gather for your skill. Will you still be able to “hold the turn” within your skill, or will you always have to hand back to Alexa after each response?

For those of you with existing skills worried about the amount a work a rewrite will be, they did mention it would be possible to transition your skill over gradually to the new format and you wouldn’t have to do it all at once. They did not say whether it would be required eventually to be fully in the new format to have your skill still available.

Conclusion

I’ve got to admit it’s all very clever. There is a real problem in the chatbot world of Conversational Design being forgotten or done poorly. This puts a focus on it, and shifts the power into Alexa and away from the skill developers needing to put as much thought into it. This allows Alexa to better control the overall User Experience and Conversation by letting Alexa take control of Dialog Management, Context and State.

Developers will need to shift their thought processes, and it will change the balance of power a developer has when someone is in their skill. However, it will undoubtedly be better for the Customer, and so I cannot begrudge anyone that.

They are actively looking for developers who have a good use case they would be able to release in the next few months to take part in the preview and work directly with the engineering teams to refine this. If that’s you — apply for the preview here, this is a pretty exciting opportunity.

I love conversations! Let me know what your thoughts are here, or over on twitter @virtualgill.

Alexa Conversations: A Developer’s Opinion from re:MARS

Part 1: The Customer Experience Demo

Part 2: The Developer Experience

Conclusion

Written by Gillian Armstrong