When Robots Speak “Human”

Language considerations for designing Alexa Skills

Published in

thirteen23

6 min readFeb 15, 2017

I’ve learned quite a few interesting things while prepping materials for the SXSW Workshop that Tom Hudson and I are running this year. After editing the data for 35 Alexa skills, and going through each of the 1600 lines out loud with Alexa, some clear patterns emerged of what type of language works best.

I was surprised to learn that language which works on the web can be virtually unusable in an audio interface. Some of this copy can be modified with audio-friendly grammar patterns, but some needs to be caught with a careful ear. In many cases Alexa reads input very literally; if you miss a comma or mistype a letter, Alexa will boldly push on through the mangled sentence, phonetically sounding out obvious typos. This means that designers need to be very careful of the language they include in the interface, because even small mistakes are glaringly obvious when read out loud.

Sentence length

The first thing that became clear was that sentences sound really bizarre when the speaker doesn’t need to breathe. Commas usually serve as breath markers when a piece is read aloud, but without them, a person reading will tend to make a place for them anyway. Alexa doesn’t need to do that. To make her reading sound more “human,” break up sentences wherever possible. When writing for a voice interface, it’s better to err on the side of too many commas than too few.

Original: “Losses of managed honey bee colonies were 23.1 percent for the 2014–2015 winter but summer losses exceeded winter numbers for the first time, making annual losses for the year 42.1 percent.”

Voice UI: “Losses of managed honey bee colonies were 23.1 percent for the 2014–2015 winter. For the first time, summer losses exceeded winter numbers, making annual losses for the year 42.1 percent.”

It’s also likely that in a voice interface you’ll have to deal with short-term retention issues. Interface designers already face this frequently, such as designing accommodations to prevent disappearing form field labels when a user clicks into a field. In an audio interface, this mostly appears in long or convoluted sentences that introduce multiple ideas. Unfortunately, it’s also harder to track retention issues in an audio interface, because it’s easier for a user to become distracted or inefficiently multitask. The best defense is to write in short, clear sentences with a single central theme. The second best defense is to test your interface effectiveness with real humans.

Numbers

It’s even easier to get lost in a long sentence with multiple number sets than in another, similarly long sentence without numbers.

In a sentence like the following, it may be hard for a listener to remember which numbers are related:

“Of 160,000 e-mails and instant messenger conversations collected under Section 702 between 2009 and 2012, 90 percent of the communications the government had captured and retained were from online accounts not belonging to foreign surveillance targets.”

If you’re comparing two sets of numbers, try to place them closely together in a sentence. If possible, put distance between dates and unrelated numbers. Revised, this copy could read:

“Between 2009 and 2012, 160,000 emails and instant messenger conversations were collected under Section 702. 90 percent of these communications were from online accounts not belonging to foreign surveillance targets.”

If possible, you should also consider rounding the numbers in your interface. “More than 40%” may have just as much impact as “41.5%” in a sentence, and be much quicker to understand.

Also, be sure to drop unnecessary decimal points. Provided the decimal point isn’t making an intentional statement, 4.0% will be read like “four point zero percent,” and could be simplified to 4%.

Oxford commas

Oxford comma advocates may end up vindicated when we all ditch our phones for voice interfaces. While a minor grammatical issue, Oxford commas (otherwise known as Serial commas) are common enough to warrant a discussion. Style guidelines differ on whether or not they advocate use of these commas, so both with and without are technically correct for written language. But in an audio UI, missing this comma means a sentence may sound very awkward. Humans will pause if reading a series regardless of the second comma, but Alexa’s interpretation makes it sound like the last two items are grouped.

Original: “Almost half of the world’s timber and up to 70% of paper is consumed by Europe, United States and Japan alone.”

Voice UI: “Almost half of the world’s timber and up to 70% of paper is consumed by Europe, United States, and Japan.”

Other sloppy writing practices that will come back to digitally haunt you

There are a few additional things that might pass in a written piece but don’t translate well (or at all) to audio interfaces:

Repeated words: You’ll need to be extra vigilant against these, as they are much more obvious when read aloud.
Double hyphens: Putting “- -” (no space) instead of an em dash “ — ” to indicate a pause is fairly commonplace, and some text editors even translate that into an em dash automatically. But note that Alexa reads it quickly as a single hyphen.
“And/or”: It’s not unusual to see this particular wording in written or informal text conversations on the web, but if you use for Alexa, it will be read literally as “…and or…” If you’d like Alexa to read the slash, you can enter it as “and / or”, but you may consider rephrasing to avoid it if possible.
Caps =/= spelled out: Writing LAX will cause Alexa to attempt a phonetic interpretation, but writing L.A.X. or using a SSML tag will prompt her to spell it out.

It’s also worth noting some of the other cool features Amazon gives developers to control language. While we don’t have much control over Alexa’s inflection or rhythm, Amazon has given us wide-ranging control over pronunciation and interpretation. Developers can use SSML tags to spell out words by letter, add pauses, interpret numerical values in numerous ways, or even change pronunciation to match regional dialects or foreign words.

At the end of the day, while Alexa is surprisingly adept at interpreting most input, there are still many small things that can be done by designers and developers to improve the experience for the end user. Designing for voice is a complex endeavor, but care with your language will get you a long way.

thirteen23 is a digital product studio in Austin, Texas. We are an award-winning team specializing in human-centered design. Together with our clients, we conceive, design, and develop intelligent software. Ready to build the next big thing? Let’s talk.

Find us on Facebook, Twitter, and Instagram or get in touch at thirteen23.com.