Alexa, open WhatCocktail 2/2

Exploring personalisation

7 min readFeb 7, 2018

In 2017 me and my team were asked to build a website with a chatbot and I ended up creating a skill for Amazon Alexa.
The challenge was to build a conversational interface to help people do better cocktails in their homes. This is what I’ve learned and what I would have done differently with the knowledge I have now.

Look here for part one and an intro to what Alexa is and how does it work.

Adapting our workflow for voice

We formed a production team with Back End, Front End, Interaction Design and Copywriting.

My usual approach is: research, prototype, test and review. Then iterate.

Research: as Design Lead I had to read all the available design documentation from Amazon, plus looking deeply into chatbot design, conversational design, writing scripts and interactions with voice activated devices.
Prototype: the Alexa development platform is mainly for coders, everything should be written in JSON and we had to build our API before we could properly prototype on it. So our initial prototype was mainly rehearsing conversation with people to create a set of utterances and understand how people would ask for cocktails.
Test: defining a test plan was particularly difficult as people will often be unfamiliar with Alexa or voice-activated interfaces in general. So we had to count on the Amazon internal testing and be prepared to a lot of introductions to voice-activated devices with our focus groups.
Review: we had periodic reviews with stakeholders, the internal team and testing subjects, which helped us refine each feature very quickly. When I’m writing this article, this is still in progress.

Tone of voice

It is quite obvious that the tone of voice and copywriting should be considered as the main UI in this kind of device. Here few basic tips.

Act out your scripts: also use props if necessary. You want to learn when to pause, when to emphasise, when to vary answers and when to be consistent in your script.
Speak like your audience: this is pretty basic copywriting advice, but Alexa should speak with a tone of voice that relates with your core audience. If you speak with them for research, also log the terms and tone they use.
Take care of lexicon: if you have specific words with a peculiar pronunciation or abbreviations, curate the way Alexa should speak them out. Keep in mind she may identify some lexicon as swearing, like rimming a glass, and bip it out. Be ready for some synonym research.
Make it accessible: voice-activated devices need to be considered for a large audience even if they are still edgy devices. Self-explanatory and short words, right moments for emphasis and pauses will make a lot of difference.

More is available in the Amazon documentation.

What we’ve done

To avoid long unnecessary conversations, one of our first solutions was to keep Alexa’s responses snappy and on point. There are moments when people listen, like when they get results and need to understand what to do and what they can pick, and moments when people don’t, for example when they are selecting which filters to use when refining their search.

Voice activated devices shouldn’t make your interaction longer and more difficult than a screen, otherwise they defeat their purpose.

After shortness, randomisation is your best friend: we got bored very quickly of listening to the same answers over and over, so we gave Alexa multiple sentences and randomise them to vary the answer to each intent. This may sound basic, but makes a big difference to a recurrent customer.

There is a learning curve with any product or service, but there is also a learning curve with new devices. Therefore people have to learn how to communicate with Alexa first, then learn how to communicate with the skill.
We accounted for this curve and tweak the way she answers to you after the 10th time you use our skill, we assume people will know how to get around the different functionalities and testing confirmed it. Of course, help and hints will always be available.

Alexa can be taught how to pronounce certain words: BBQ isn’t automatically pronounced as barbecue or “B.B.Q.”, it will sound like one word instead of three single letters. There are markers to specify how to pronounce certain words, or give them specific intonations.

A start on personalisation

There is no way to personalise the experience without retaining users’ data, and Amazon is very strict about data. You won’t know that Paul Taylor used the skill, you will only know that ID 389370 used it. But this didn’t really stop us from personalising the experience. Even if we can’t apply those market segmentations, from a UX perspective we could still serve a better experience every time a person access our skill.

How? Simply saving into our DB how people use and interact with our skill. We will know when they used it, which intents they launched — including stop that will be our red light for problems in our flow — which items they viewed, how many times, etc.

Retaining the number of sessions they opened, we were able to account for that learning curve I mentioned before and speed up the conversation when people were already familiar with the skill; for example change the welcome message and give more accurate suggestions every time a customer was coming back.

This is only the beginning, we want to push further the technology, the market and personalisation not just within Alexa devices, but across a whole ecosystem.

A focus on the EchoShow

We found out that there is a hard limit on the response weight that Alexa can manage of 25Kb. This may seem enough, but when the Echo Show is involved, images and text are displayed, therefore you may want to keep your information at a minimum; any extra data that won’t be displayed immediately, any pre-load, will saturate the hard limit and through an error.

You will have very few templates to choose from. At the time of the article, I had up to seven templates I could use and some have known bugs that will make them difficult to use. On top of this, they behave in a way that you won’t always expect as a designer or a customer.

Example: the template ListTemplate2 has multiple items with one centred image, test and body copy. Because the image is centred within the text, if the image is not squared, it will look not aligned. On top of this, if the text has a particularly long word, the container of the whole tile will increase size, creating an odd spacing within items.

The touch screen doesn’t behave as you would expect a touch screen to do in 2017. It’s not a tablet: it doesn’t zoom, you can’t “pinch” it, it simply scrolls vertically or horizontally. One of the most surprising feature is that it doesn’t scroll both vertically and horizontally in the same template: you can either have a template that scroll vertically (e.g. text and image template will have scrollable text) or horizontally (e.g. a list template with images and text). Trust me, your testers will be confused.
Amazon expects Alexa to stop speaking if the user touches the screen, but the session will be still open and Alexa will expect a certain prompt from people, even if they have no way to know which one as she stopped speaking. We challenged this as a not friendly UX pattern.

When you are looking at a screen and the session is terminated, which happens automatically after 30 seconds, the screen will return to the Alexa home. This means, if you’re thinking of making any recipe, that you will have to invoke it again, or keep touching the screen to keep it active. Considering the EchoShow was supposed to live in the kitchen, I find this a highly unexpected behaviour.

Sure, Alexa should be mainly voice-activated and it sends a card to your phone, but then what’s the advantage of a screen I can control without using my hands when I cook if it will turn off after 30 seconds?

The problems

Well, it turned out Alexa is much less flexible than it looks. The ability of recognising what people say gets worse when the ambient noise is high, it also reduces when the slot values are hundreds. Alexa will try to fill what she hears with a possible slot value and accents aren’t helpful for her.

New technology means bugs, lack of documentation, lack of examples, small development community around the technology. It also means people aren’t familiar and get frustrated as much as a designer or developer would.

Sometimes there were no technical solutions for some of our required features. But think also that Amazon couldn’t foresee all the scenarios clients will come up with for Alexa. This created a lot of frustrations, but also made our creative muscles work very hard to find solutions around a clearly immature technology.

Amazon documentation is incomplete and non exhaustive, the approach is very different from typical development, even the code pushes and the reviews aren’t keen on reviewing the code or following common pattern of software development processes.

This — I want to be clear about it — is and isn’t Amazon responsibility at the same time. I do expect from a development company to have a clear process and exhaustive documentation, but being in such an early development stage and being the hardware itself a prototype, we can’t expect to be an easy win.

Why Alexa then?

Alexa is still, at the time of the article, the biggest market share and Amazon is pioneering a lot of the features available. The marketplace is open and the documentation is available to everyone.

I personally prefer DialogFlow and Google Home, but when we started they just weren’t there yet. The day will come when we’re going to expand the ecosystem to multiple devices, platforms and interfaces.

It’s a new and interesting field, with a lot of potential and not enough exploration yet, therefore problems should be considered as part of your learning curve. With the expertise we have now, we would definitely approach the whole process in a different way. And that’s why we are still refining the experience to make it as smooth and pleasant as we can.