Talking with machines? Voice interfaces and conversation design
We — including Martin Porcheron and his PhD supervisors, Joel and Sarah — have been doing empirical studies of voice interfaces. Although Martin has also examined the use of Siri in informal social settings in the past, here I want to talk about our recent work looking at the Echo, a voice-based ‘personal assistant’ from Amazon that uses the Alexa Voice Service. (A much more in-depth presentation of this work can be found in our ACM CHI 2018 conference paper.)
This research obviously connects with broad interest in ‘conversational interfaces’ and ‘designing for conversation’. This encompasses voice UI but also things like chatbots. The use of the term ‘conversation’ has also been applied to situations that might previously have been referred to as ‘dialogues’. But here I want to just talk about voice.
We start with Amazon’s rather cringe-worthy commercial from 2014 in which they depicted a family using the (then new) Echo:
There’s a lot you could say about this depiction. It turns on the kinds of stereotypical depictions, limitations and erasures one might expect from a ‘vision’ (been there, done that). Amazon has since rowed back on this (by deleting the video and offering far more muted attempts), but in a way it remains useful for us because it’s not that different to the way voice interfaces are often presented and discussed.
The commercial offers the familiar confections that new technologies are often gilded with, i.e., naturalness and inhabitation. Devices like this are pitched as things which will naturally ‘understand’ humans, fit in ‘seamlessly’ to our lives, and come to ‘inhabit’ our social spaces. This is never the case, of course.
That’s how they’re sold to us. But what about how they are designed?
Current thinking both for voice-based interfaces as well as text-based chatbots seems to rely on conceptualising a conversation — and the design of conversations with machines — in terms of possible scripted conversational exchanges, and flowcharts or decision trees.
Obviously I am simplifying the process greatly. I understand why designers might be thinking about conversation in this way: it’s practical. But our research led us to be increasingly curious about what these supposed ‘conversations with machines’ actually look like.
Studying people using voice interfaces
I do studies that employ ethnomethodology and conversation analysis (EMCA) and our study on the Amazon Echo is no different. In short, doing EMCA studies means that we are interested in working out just what the practical, social accomplishments of people with technology really entails; finding that stuff out means that we tend to do very detailed analyses of human (inter)action.
Martin has been leading this research, collecting hours of Amazon Echo use by deploying an Echo to five households for month-long periods. He captures what happens via a sensitive Conditional Voice Recorder (CVR) he built. This device records a minute before and after the Echo is in use. Members of the households could see when the CVR was recording and choose to turn it off with the press of a button. The broader data set contains hours and hours of captured fragments. Here we’ll just take one example.
We join a British family of four — Susan (mum), Carl (dad), Emma (daughter), and Liam (son) — as they eat a meal together. It happens to be Mother’s Day and here the family is eating a takeaway as per Susan preference. The family typically eat breakfast and dinner at the table together. The Echo deployed in their house as part of the study is sited on a sideboard near the dining table. They are about a week into the month-long deployment, so while they are familiar with it, they are also exploring what it can do. Thus far they have often used the Echo for listening to music, but also using Skills like Quiz Master (a trivia quiz) and Beat the Intro (which plays a few seconds of the start of a song and players must guess the song name).
The vignette presented here is quite long, but, I think, informative. It has been transcribed using a simplified version of Jefferson notation. We do transcription in this way because here it helps us show not just what is said (words), but also a bit more about how those things are said. This will become important later. In summary: pauses in seconds and fractions of a second are indicated in parentheses, e.g., ‘(1.5)’ is a 1.5 second gap; overlapped talk is indicated by square brackets ‘[’; micropauses which are less than 0.3s are indicated with ‘(.)’; inaudible bits of speech are shown in empty brackets ‘( )’; and double brackets indicate other kinds of things happening e.g., ‘((laughter))’.
First, the most visible thing in this fragment is ‘failure’. We must significantly circumscribe the claim that “the problem of recognizing spoken input has largely been solved”. While individual dictation currently works well (note that dictation itself involves a whole host of quite particular speaking practices in order to get it working successfully), the same cannot be said for anything other than the most controlled of circumstances. It confirms the caution we should feel about the simplistic claim that “human parity” transcription has been achieved. While we could criticise the way Alexa Skills has been designed, this is beside the point. In our study — and I note that these figures are most definitely provisional and in any case always subject to how you count ‘failure’ — the Alexa logs alone indicated a 30% ‘failure’ rate across homes. Factoring in the parallel recordings from the CVR, which Martin designed to be much more sensitive to the “Alexa” wake-up word, we can also locate moments where the Echo didn’t detect the Alexa wake word at all. This means that the ‘failure’ rate rises to about 50% of the time. In other words, for our study at least, 50% of the time participants couldn’t get something done with the device. The sense of ‘failure’ here is how participants themselves treated the outcome of the interaction with the Echo, rather than some measure that we externally impose.
That said, in a way we really aren’t that interested in whether things succeed or fail. It’s not our problem. We don’t have any stakes in the success or failure of voice interfaces. Further, this study is not out to bash Amazon about how ‘bad’ Alexa can be. Instead, as academic researchers, here we are interested in delving deeper into how participants in the study encountered and dealt with trouble. How they do this is quite revealing and, we hope, offers opportunities for conceptual development around notions of ‘conversational interfaces’ and the design of voice UI.
Before I go further, I need to briefly return to ethnomethodology and conversation analysis, so as to orient you, the reader, very cursorily to the kind of approach we employ when looking at our data. For us, a critical point is that talk is action. When we talk we are trying to get something ‘done’, ‘done’ together, with others. In our fragment this is things like ‘doing’ having a meal together, ‘doing’ playing a game together, ‘doing’ telling a joke, ‘doing’ parenting a children, and so on. We put ‘doing’ in front of these things to underscore that we are interested in unpacking practically how someone, say, brings about the telling of a joke and how that telling is treated analytically by co-present others. In other words, we not seeking to assess the joke itself from our position as researchers, but from point of view of members of the setting.
Voice interface use is embedded into home life
Let’s focus on a portion of the larger vignette, which we’ll call Fragment 1:
Here Susan is trying to instruct Alexa amidst a range of other stuff being done:
- first, Susan offers a ‘preparatory account’ for the other family members (line 01), describing what action she is about to do: she would like to play Beat the Intro;
- Susan’s account here then occasions Liam’s initial “oh no” (line 02);
- nevertheless, Susan presses on with an initial instruction to Alexa: “beat the intro” (line 03);
- finally, Liam elaborates his ‘negative sentiment’ about the prospect of playing Beat the Intro together into a long, drawn-out “nooooo” (line 05).
About 0.6 seconds into Liam’s “nooooo”, Carl says “it’s mother’s day” (line 06). The placement of what Carl says, and in the way it is employed during Liam’s drawn-out “nooooo” suggests an attempt by Carl to get Liam’s compliance. This both very broad, in that the reminder of it being Mother’s Day brings with it certain assumptions, and very specific in that the timing of Carl’s utterance during Liam’s “nooooo” us used to align with what Susan is saying she is about to do (and then does) with the Echo. In other words, Carl is leveraging Liam’s normative obligations, and the rights that Susan is seen to have, particularly on Mother’s Day.
The point is that Susan’s instruction to Alexa here to “beat the intro” is not an isolated utterance but is thickly embedded amidst this conversation.
The second thing I would draw attention to is that line 07, Susan then switches activity. She moves from addressing Alexa to addressing Liam, issuing a directive about him eating his food “you need to keep on eating your orange stuff” (lines 07–08). Susan interleaves her interactional work with Alexa with such parenting activities.
Susan then returns again to addressing Alexa on line 11. This happens alongside the initiation of what turns into a collaborative joke between Carl, Emma, and Liam about Susan’s turn of phrase when she said “keep on eating your orange stuff”. Carl mentions “and your green stuff” “and your brown stuff”, then Emma has a go, “and the yellow stuff?”, and finally Liam completes this with “and the meat stuff”. This joke is interleaved between Susan’s attempts to get Alexa to initiate a game of Beat the Intro (lines 11 and 13).
What’s the point? A couple of things here.
Firstly, although our participants became highly sensitive to moments when someone was possibly about to address Alexa, the design of voice interfaces are predicated on one-at-a-time type interactions.
Secondly, real world, complex, yet highly ordered multiactivity settings are the norm and remain a serious technical and design challenge for voice interfaces. This is the world that voice interfaces are going into.
Voice interface use is about request and response
Much of the data that Martin has collected suggests to us that thinking about interactions with voice interfaces as ‘conversation’ or ‘conversational’ can sometimes be misleading. We think that instead we can nuance this idea by more deeply considering the design of requests to, and particularly responses from, voice interfaces.
Let’s look at two ‘failures’ that go differently. First, here’s Fragment 2, which is a moment from elsewhere in this family’s meal where they are trying to invoke Alexa to start something they initially call a “family quiz”:
While they do eventually get the Quiz Master Skill opened some time later beyond the end of this fragment, I’m interested here in how Alexa responds, and how the family deals with trouble in the response.
Susan initial request to Alexa is an instruction, “set us a family quiz” (line 02). (Note a different kind of ‘preparatory action’ here where Emma asks Susan to address the Echo, line 01). Alexa’s response is “I can’t find the answer to the question I heard” (line 04). This response categorises Susan’s utterance as a question not an instruction. Does this matter? Well, it’s an error message from Alexa but the problem is that it offers little in the way of ‘next actions’. By ‘next actions’ I mean the following. Conversation analysis offers strong evidence to suggest that when we talk, we are constantly working out how to make sure that our talk is sequentially organised. Sequentially organised means that one utterance follows the next, and that present utterances set the stage for how future ones are heard / acted upon. This is what Harvey Sacks calls the machinery of interaction.
So what happens next? Emma has few places to go after this, so she repeats Susan’s instruction with a slight variation, “set a family quiz” (line 06). We see this kind of repetition and variation frequently when users are trying to deal with trouble in use. Alexa then responds with another similar miscategorisation “I don’t have the answer to that question” (line 08). And then we get Liam with another variation that displays a recognition of the situation and transforms his attempt at an instruction to Alexa into something amusing: “Please set a family quiz” (line 10). Finally, there is another similar response from Alexa and another even more paired down attempt from Carl “Alexa, family quiz” (lines 13 and 16).
This is an example of collaborative repair by our participants. Collaboratively-produced, minutely varied repetitions of the request to “set a family quiz” seem to be closely aligned with the repeated unhelpfulness of the responses from Alexa.
In contrast let’s look at an alternate way these kinds of designed requests / responses might play out. While this is also a ‘failure’ it turns out quite differently for our participants, I would argue. Here’s Fragment 3, a reminder from the family trying to play Beat the Intro:
How is this different? I want to draw attention to the response from Alexa (line 04) and what it lets Emma do next after having instructed Alexa to play Beat the Intro. The response from Alexa incorporates a transcription of what was heard “b b intro”. Although it is actually a mistranscription in this case, the response here nevertheless builds this transcription in, and offers a candidate next action as a question, i.e., the action being to “hear a station”, formulated as a question, i.e., “right?”. The difference here is that this gives Emma a place to go, and she makes the next move, i.e., “no” (lines 05 and 07). The sequence then draws to a close with Alexa’s “alright” (line 09).
The point is that the response design here differs a lot from Fragment 2. Here, the response gives participants the interactional resources to move on, sequentially, to do the next action and to progress what they are trying to get done.
I think we can revise the concept of ‘conversational design’, by talking about sequentially organised ‘moves’ around request and response instead. I’d summarise this notion in the following way. Firstly, responses from Alexa are treated as resources by participants — resources for further action. So, responses like “interesting question” or “I didn’t understand the question” offer little purchase for that as a result. Secondly, it seems pretty important to consider how to explicitly design in those resources, and embed them in responses. Thirdly, responses enable certain kinds of possible next moves in the sequence but also shut down others. So it’s not necessarily about establishing ‘rapport’, ‘personality’, or some other abstract idea, perhaps, but instead concretely thinking about how responses enable progressivity for users.
Two things. Firstly, we might worry that conversation is idiosyncratic, messy, and hard to understand. But, in line with what ethnomethodology and conversation analysis tell us about their studies of the organisation of social action in the world, talk is actually very highly organised, and indeed designedly so by conversationalists.
To illustrate this let’s go back to a portion of Fragment 3:
On initial inspection this seems thoroughly mundane. But let’s unpack what is happening here:
- Carl inserts a question for Emma mid-flight through her address to Alexa (line 02);
- Carl orients to the little micropause (line 01) that is presumably opened up as Emma checks that the light on top of the Echo is on, indicating it’s listening (we see this routinely in our data);
- Carl inserts his question at just the right moment;
- this then provides an opportunity for Emma to course correct her instruction;
- Emma suspends her talk precisely for the duration of Carl’s utterance;
- finally, Emma’s next move ultimately seems to reject this and carries on, incorporating “beat the intro” into her instruction.
The point is that this is a virtuoso demonstration of the fine-tuned organisation of talk that competent conversationalists routinely produce. For ethnomethodologists and conversation analysts, this is the stuff the social world is made of.
The second point I wanted to raise here is that looking at actual conversations around voice interfaces can be extremely revealing. I think it can move us beyond scripted dialogues. Looking at real instances of talk seems initially daunting but I think it can begin to nuance our thinking away from idealised ‘conversations’ to consider how talk is sequentially organised, continuously, by people, to mutually achieve concrete things — to get stuff done — with each other and in this case, with voice interfaces. It seems to me that this can be helpful for design.