Me: “Siri, skip to the next song please.”
Siri: “Playing ‘Next Please’ by Jan Garbarek from Spotify”
Me: “No no Siri stay on this album but play the next song on it”
Me: *bangs head repeatedly on countertop
If you’ve used a personal assistant like Siri or Alexa personal, you’ve probably experienced something like the above situation. The user needs to retrace their steps, and the assistant is unable to handle it. The machine learning experts often refer to this as a problem of “context”, while software people might described it in terms of the voice software’s “state management”. What both these groups mean is that voice software struggles to use information from one interaction to another, and so often struggles at multi-step processes where you need earlier context to use later on. Unfortunately for the PAs, human language is just such process.
I prefer to view this challenge to audio software architecture through the lens of “Ergodicity”, which just means amount of steps you can get through before disaster strikes. The example often used by Nassim Taleb is trying to find the average gain/loss on a hundred nights of playing cards in a casino. If you give a hundred people a hundred dollars for the night a few will go bust, a few will make $3,000, and most will be somewhere in between. If you then calculate the average amount lost, you will find that the house has a slight edge and that the average loss was, say, $12. No player needs to know or care about any other player for this study to go well; these hundred players don’t depend on each other in any way, and nothing one does should affect the performance of the others.
However, if you give your favorite Uncle Leo ten thousand dollars and tell him to go play a hundred nights of poker in a row at the casino, waiting to see how much money he has at the end, you will probably not complete the study because he will most likely lose all his money long before he gets to a hundred. One evening, sooner or later, he will go bust (much like a few people did in the first version), except this time that will be the end, with no more follow-up nights at the casino. For Uncle Leo, his performance one night drastically affects the following ones.
The first version of the casino problem is called “ensemble probability”, and has taken into account the challenge of ergodicity, as a few players can go bust and not affect other players. Uncle Leo’s version is called “time probability”, and it has not taken ergodicity into account since Uncle Leo can go bust and it will negatively affect all the other hands to follow (since he’ll be out of money).
Normal frontend software architecture, such as the browser you are probably viewing this article on right now, have many obvious things you can do to control your experience. Go back to the last page, click on the logo to go to the home page, start writing in the browser bar to search for a term, etc. And, if you screw something up while navigating a website, there’s a good chance you’ll have a visual indicator like an “x” or a back button, which clearly tells you how to exit the problematic area. It’s rare to get “stuck” on a page on a well-designed website these days unless it involves and actual technical failure.
Audio software, such as Alexa’s skills or Siri’s commands, are much more like Uncle Leo. Once one step goes wrong, everything goes wrong because there is no clear way to fix it. The audio equivalent of the omnipresent user experience features on traditional frontend software such as “back”, “close”, or “quit” are unclear with voice. Voice has low information-density available to the user on how to actually use the thing; whereas a website will have headlines, labels, and buttons visible to the user, a voice experience cannot show anything. If the user hasn’t memorized these commands then they may have a hard time knowing that they exist.
As a result, a lot of voice technology is either a) a poor experience because it’s very easy to get stuck or b) a good experience, but only does simple, single-step tasks. Which is why it’s so hard to scroll through a Spotify playlist on an Alexa, like we see in the example above. Audio architecture is a fundamentally different beast than user-centric frontend architecture, but there are ways to beat the ergodicity problem and design good voice software.
The first step is to keep it simple, with as few steps as possible, and the second is to keep it intuitive. While all software should aspire to “simple”, audio software needs it more than most from the user’s perspective. Aggressively reduce user steps and minimize complications, and handhold the user much more than you might think you should. When doing anything complex, make voice interactions very specific, such as “Tell me the name of the song you want me to play, or name an artist and I will guide you through finding some of their best songs.”
Intuitive is harder (after all it is quite subjective), but try to stick with words/phrases you think most people might use, then test the hell out of it to be sure. If you are making users do multi-step actions, make sure you are very clear about what they need to do, and what they should do if something goes wrong. Have a bias towards wordiness as you are building out your software, then edit to streamline this content once your audio software is almost complete.