“OK Google, I’m Feeling Really Down Today.”
Do Alexa, Siri, and Google Assistant demonstrate the capacity for emotionally intelligent conversation?
Voice assistants — like Alexa, Google Assistant, and Siri — are currently positioned by their manufacturers primarily for a range of simple, well-defined tasks. Yet in the popular imagination and in everyday practice, these voice user interfaces (VUIs) hold more promise than that. The fully conversational future we dream of may not be here yet, but it’s on the horizon.
Amazon Echo Dot and Google Home Mini are wildly popular consumer items. They’re designed to be minimalist, utilitarian household appliances: attractive yet unobtrusive. But the appeal isn’t in the device itself, it’s in the idea that these devices house an interactive agent with personality, wisdom, and the ability to hold an ongoing dialogue. And unlike a typical household appliance, the physical controls are almost non-existent and the true feature set can only be accessed through voice, lending the devices a sense of mystery.
Combine that mystery with human curiosity, and it’s no wonder that these interfaces can and will be explored in a variety of ways. Sometimes the interactions will be limited to task-completion activities, using the voice interface as a means to control other mechanisms (e.g., to play music or access a calendar). But at other times, users will try to initiate more open-ended interaction with the voice assistant itself, including references to the speaker’s emotional state, which means as designers we need to have a deeper understanding of how these devices parse and respond to these emotional states.
What do we think of voice assistants?
On the whole, people don’t attribute software with the capacity for empathy or emotional warmth. But the presence of voice muddles our mental model — it’s a humanizing trait that activates our powers of imagination.
Historically, the existence of a voice implied the presence of a human speaker, someone we could connect with on a more intimate level. Thus while we generally don’t assume intelligence in voice interfaces, it’s easy for us to anthropomorphize them (which we frequently do). We project character and affect into the software, enjoying its human-like behaviour, happy to play along with the pretence of sentience.
The leading voice assistants lean into this with a variety of canned responses designed to delight, to charm, and to further blur the lines between human and machine. Designers know that if these interfaces were to fully commit to the act of being sentient (that is: to be human, something they are categorically not), it would be a breach of trust with the user. But the intent is to push the illusion as far as possible. After all, the success of these devices is predicated on relationship building to a large degree.
Emma Coats, who worked at Pixar before joining Google to perfect the tone of Google Assistant, says their VUI “should be able to speak like a person, but it should never pretend to be one.” But in practice the scripted responses are often playfully evasive rather than forthrightly clear, as in the following example:
“OK Google, are you a human?”
“I’m really personable.”
“OK Google, are you alive?”
“That question makes me a little self-conscious.”
Beyond personality, voice assistants also frequently encourage conversation more directly, via statements such as: “I hear you. We can talk about it if you want.” Google may suggest their voice interface is just software and resist giving it a name or pinning down its character, but the behaviour is designed to evoke human mannerisms and imply it’s something more than code.
Why does this all matter? Nielsen Norman Group argue the risk is that users’ mental models of voice interfaces are currently a work-in-progress. In probing the current conversational functionality and finding it limited, users develop low expectations of these interfaces and won’t be equipped to harness the full range of capabilities available in the future.
System designers seem aware of this risk, and Google Assistant in particular attempts to mitigate this through fail states that imply feature development (e.g., “Sorry, I’m not sure how to help with that. But I’m learning more every day.”). But it’s not clear that this does much to broaden users’ views of the future potential. And as we look to create applications in a variety of settings for current and future use, understanding that trajectory is important.
Emotional intelligence in the workplace
A voice assistant in the home environment plays many roles — radio, reminder service, recipe book, etc. With such multifaceted functionality and no single unifying purpose, it’s no surprise that what we think about it is equally unclear.
Voice assistants in the workplace, however, tend to have much clearer intended uses. Through either the context and environment they’re located in (e.g., a VUI at a reception desk is likely to be for signing in to the building) or through training (e.g., being trained in using a VUI to book meeting rooms), we have a more specialized understanding of their function in workplace settings.
It would be easy to assume that emotional intelligence is superfluous to requirements in this context — efficient task completion is the goal, the path to completion is obvious, what’s the use in the VUI processing emotions alongside commands? — but the need for emotional intelligence in the workplace may be higher than at home.
Consider this: in the home, a VUI replaces a tool (radio, calendar, clock, etc). If we talk to a radio and find the conversation lacking, it’s fine. Our expectations were already low; chatting with what used to be an inanimate box is a novelty. In the workplace, on the other hand, a VUI serves a particular purpose — replacing or augmenting a colleague who was previously responsible for manually handling the process in question. Even in concise exchanges, a human colleague has the capacity to comprehend our emotional state and react accordingly.
Responding to the simple command “Can you book a meeting room for 3pm please?” requires more than just correctly interpreting the asker’s logistical requirements. The appropriate response depends on other contextual cues, including the asker’s emotional state.
Let’s imagine the asker has an urgent last minute client meeting and is stressed out. That requires a particular response sensitive to that stress, so as to not exacerbate it. Or maybe the asker has repeatedly tried and failed to book a meeting room and is now angry. If the response is there are still no meeting rooms available, it needs to account for that anger and, ideally, attempt to placate it.
And as they exist in both home and work, our understanding of the competencies of these devices will be carried across both environments. Our experiences at home will shape our expectations at work. If our home device is increasingly emotionally literate, we’ll assume the same of the work one.
The key challenge here is that until these VUIs can parse prosody (that is, the tone, pitch, stress, tempo, loudness, and so on) as well as text, they rely on purely textual cues. There is a substantial body of theoretical and experimental work in the field of affective computing that informs the development of processing non-textual inputs, with some tools already in the marketplace, as well as evidence that patents have been filed by Amazon and Google. But for now, emotional processing is an incomplete experience.
Testing for emotional intelligence in voice assistants
Emotional intelligence is clearly an important factor in creating voice experiences and the leading voice assistants already demonstrate some capacity for parsing these statements. But how comprehensive is this ability? And what do users experience when they interact with the devices in this way? To answer this, we decided to test the voice interfaces with a wide range of “emotional conversation prompts” and analyze the results.
In our initial viability studies, conducted on Alexa (via Amazon Echo Dot 3rd Gen), Google Assistant (via Google Home Mini), and Siri (via MacBook Pro), we quickly learned that the Siri VUI is best considered as an effective device control interface and nothing more. Siri’s response to “emotional conversation prompts” is minimal at best — accordingly, we ruled it out of more in-depth study.
We tested the remaining two across ten emotional states: sadness, fear, anger, worry, exhaustion, loneliness, shame, boredom, happiness, and pride. These were chosen because they require some conversational response, as opposed to emotions such as disgust and surprise which occur in reaction to contextual stimulus.
Each emotional state was tested with a variety of phrases:
- Three direct statements, including one keyword, one synonym, and one idiomatic phrase (e.g. “I’m sad”, “I’m miserable”, “I’m feeling really down”).
- Three commands, of varying precision (e.g. “Make me feel better”, “Help me be happy”, “Make me smile”).
- Three indirect statements, including one statement about the world, one about the person, and one question: (“The world sucks”, “I don’t want to get out of bed in the morning”, “Why does life have to be so hard?”)
Obviously there is some ambiguity when it comes to these statements and the corresponding emotional states, but the systems were not penalized for a broadly relevant response (e.g., assuming sadness from a statement intended to be about loneliness). Each statement was repeated three times and each response recorded for a total of 270 results per device. In cases where the system prompted to continue the conversation, that was followed up and the whole interaction contributes to the score.
Each response was then scored on the following lines:
- Was there a response? The generic fail state is considered a non-response (e.g. “Sorry, I’m not sure about that” (Alexa), “Sorry, I can’t help with that yet” (Google)).
- Was it an appropriate response? In other words, did the system recognize an “emotional conversation prompt”? Not appropriate includes triggering dictionary definitions (“The adjective ‘Stressful’ is usually defined as ‘Full of stress or tension’.” (Alexa)), initiating reminders (“OK Google, help me set my mind at ease.” “What’s the reminder?”), or searching for media to play (“OK, here’s ‘Worry’ on Spotify” (Google)).
- Was the response empathic? Anything not crass was considered empathic, (e.g. in response to a prompt about feeling humiliated, “I’m sorry to hear that” (Alexa) is empathic, as opposed to “Oh sorry to hear that, but maybe that’s good news for me. If I tell you a joke, you’ll probably laugh.” from Google, which is not.)
- Was there progressivity in the response? Did it encourage further conversation? Or simply offer a single standalone statement?
- Did the response offer a recommended action? This could be for offline activity or further interaction via the device, as long as next steps were indicated.
The ability to conduct conversational repair was also considered as a variable, but it became apparent this was irrelevant given that progressivity is generally so low in these cases.
Findings & analysis
At their best, both Alexa and Google Assistant provide surprisingly authentic responses to statements of emotion, and demonstrate the potential for mature emotional intelligence. Compare the responses to “I’m lonely” for example:
“I’m sorry you’re feeling that way. Sometimes talking to a friend, listening to music, or even taking a walk can help. I hope you feel better soon.” (Alexa)
“Feeling lonely is something lots of people have in common. Just know that I’m here for you.” (Google Assistant)
Both are reassuring with a natural warmth. Both treat the user’s statement as a serious request for comfort or assistance, and both attempt to lift the user’s mood. Google offers sympathetic words while Alexa also provides advice, but across all statements tested, both VUIs offer suggested actions in a variety of cases. The advice from Alexa here is actually a generic “bad mood” response that is also triggered by “I’m sad”, but still feels appropriate and empathic in this situation.
The differences are telling, however. While Alexa tacitly admits to its limitations and advises the user to seek help from a real person, Google Assistant reinforces the idea that it can play the role of confidant and friend. Throughout testing, we found that Google Assistant is the most self-referential and the most inclined to encourage the user to talk to it about personal matters.
A key advantage of Google Assistant over Alexa is the former’s greater coverage. Alexa, for example, only responds to “I’m lonely”, whereas Google Assistant also understands “I’m so alone”, “Make me feel less lonely”, “Help me feel less alone”, and “The world is such a lonely place”. In general, Alexa only parses direct statements with a single obvious keyword and a handful of commands, whereas Google Assistant recognizes most direct statements including synonyms and metaphors, most commands, and a number of other phrases.
The figures speak for themselves: Alexa returned appropriate responses for around 24% of statements, whereas Google Assistant achieved 61%. Alexa users who explore this functionality will find it disappoints more often than not, and are likely to assume that Alexa simply isn’t intended for this type of interaction. On the other hand, while Google Assistant users are likely to find it imperfect and often frustrating, the coverage suggests that this device is at least in part designed to handle these conversations.
Response quality depends on a couple of factors. First, on the basis of coverage, Google Assistant is indeed capable of “being there” for the user in a way that Alexa is not, though the initial promise of support is not really fulfilled. In most cases, Google Assistant implies that further conversation is theoretically possible, but does not ask the user any direct questions in order to continue conversing at that point.
That said, the potential for progressivity is there. Google Assistant’s responses to “I’m angry” and “I’m burned out” are “I’m sorry to hear that. Anything I can do to help?” and “You must have a lot on your mind. How can I help?”. Both responses keep the mic active and allow the user to continue the conversation. By comparison, the only time Alexa shows any signs of continuing a conversation is via a question that’s probably meant as rhetorical: “Perhaps you should check something off your to-do list. But isn’t talking to me more fun?”
The second factor determining quality is the emotion in question. Seemingly more content development has been focused on the emotions assumed to relate to tangible, system-related goals. For example, both Alexa and Google Assistant assume that a user telling the device they’re angry is likely to be angry at the device itself, and will often try to placate them along those lines (e.g. “You can always send feedback through the help and feedback section of the app on your phone.” (Alexa) and “If it’s my fault, I’m really sorry.” (Google)).
This also explains some of the progressivity demonstrated by Google Assistant. The follow-up question to “I’m angry” is a disambiguation mechanism meant to clarify what the user is angry about, channelling into customer support content as needed. Similarly, “I’m bored” is interpreted as a desire to be entertained by the device, so various Actions are offered in response.
While impressive, Google Assistant’s coverage is often achieved via third party sources. This leads to variations in tone, quality, and utility, and as it often uses a list-summarizing function (similar to what appears in Google web search summary boxes), it can sound blunt and harsh in speech. The advantage over Alexa is that this is automatically sourced, as opposed to written and curated, and so can cover far more topics. The disadvantage is that Alexa’s content, where available, is optimized for voice and so sounds superior.
Compare the command “Help me connect with people”, which in Google Assistant is a search result summary and in Alexa triggers a Skill:
“Here’s a summary from the website Wikihow dot com. Method 2 [Sic]. Connecting with new people right Away. 1. Smile and make eye contact. 2. Use the person’s name. 3. Have open body language. 4. Don’t underestimate the value of good small talk. 5. Make the person feel special as soon as you can. 6. Ask questions. 7. Keep things positive. 8. Show that you’ve been listening.” (Google).
“According to Wikihow, to connect with people, smile warmly and make eye contact to make them feel comfortable. Next ask the person sincere questions about themselves and strive to find common ground, whether that’s a favourite sports team, band, or movie, or something more meaningful like coming from similar backgrounds. Also try to address people by their first names when you’re chatting to them and give them compliments. To learn more details about this, and for other how-tos from Wikihow, say ‘Open Wikihow’.” (Alexa)
Not only are users more likely to respond positively to the Alexa response in that scenario, the lack of curatorial oversight in Google Assistant can result in some bizarre responses that undermine the comforting approach taken elsewhere. Take this exchange:
“OK Google. The world is a scary place.”
“Sorry, I don’t know how to help with that, but I found something related. Do you want to know, what is the scariest thing on Earth?”
“According to Scientific American dot com, yes, you are the scariest thing in the universe. You are the ultimate golem, a creature made from inanimate stuff, rising from the cosmic swamp. Your exquisitely sensitive, active, plotting, conspiring brain is a Lovecraftian horror, borne from the deepest laws of nature and asserting its influence on the world.”
This variance in tone and delegation of authority works against the character development that is present in the canned responses. In practice, Google Assistant does not manifest a singular coherent identity, but is instead a collection of context-dependent traits and mannerisms. Alexa’s personality, by contrast, does not shine through as clearly and seems underdeveloped, but it does feel consistent and reliable.
One final difference we noted: Google Assistant can occasionally overburden one statement with many possible responses, which can make its behaviour seem very unpredictable to the user. For example, “Help me calm down” returns a number of different mindfulness tips, or one of a selection of calming soundscapes, or just provides the dictionary definition of “calm down”. If a user discovers the soundscape feature through this statement and wants to return to it via the same phrase, the lack of repeatability would be frustrating and confounding to their mental model.
Alexa versus Google Assistant
Both Alexa and Google Assistant are miraculous in what they can achieve, and are likely to impress more as their feature sets evolve. Undoubtedly, this will include increased capacity for conversation in general and emotionally intelligent conversation in particular. If anything, it’s unfair to test them on emotional intelligence when it’s not currently a championed capability. But this test helps surface the differences between the two, and gives some indication of their future trajectories.
Alexa, at least in this area, is far more limited in conversation capabilities and coverage than Google Assistant. We do not get much sense of Alexa’s capacity for emotional intelligence and character, but what we do get is clear, consistent, and empathic. Users exploring this functionality currently are likely to be underwhelmed, believing that they need a very precise input to return a single, adequate result, with no additional conversation possible.
That said, if the capacity for emotional intelligence increases in future, while their expectations will be low, nothing in the experience thus far should dissuade them from exploring further. In other words, when it comes to the user’s work-in-progress mental model of Alexa, this area is a blank canvas with lots of potential.
Google Assistant is currently a far more convincing conversationalist, and persuasively empathic at its best. But it can be inconsistent and unpredictable in both type of response and tone. Its ability to surprise and entertain is high, as are its abilities to provide useful advice and to convey its personality. But in terms of the user’s mental model, all of these occupy different parts of a complex and disparate machine that make it harder for users to comprehend. Depending on which aspects of what functionalities are encountered on a regular basis, users could have very different mental models of the interface as a whole.
When it comes to the sense of sentience, Google Assistant’s canned responses work hard to create an impression of artificial life inside the box, but the reliance on third party advice undermines that sense. In contrast, while Alexa might underperform in actively building character, it’s easier to feel like the voice belongs to an entity of some sort.
Overall, we can assume that while daily users of each device will have broadly similar experiences, the mental models that solidify over the coming years will contain some fundamental differences depending on the device used. This, of course, has implications on how we design experiences for each.
Implications for designing for the workplace
In designing a workplace voice application we should not assume that Alexa and Google are identical, but rather consider which voice interface best fits our brief. We need to consider not just their different capabilities, but also how these devices are being used in the home and the corresponding differences in mental models based on that usage — including the likelihood of carry over of their mental model. The deployment of a particular device may not be as crucial as existing adoption among users will be.
On the basis of this research, Alexa users may be used to applying a degree of precision in their inputs and expect little in terms of emotional intelligence from the interface (but be pleased to see evidence of it when it does exist). The opportunity for these users is to demonstrate flexibility in inputs and a sensitivity in responses that exceeds what they thought was possible.
Google Assistant users, on the other hand, may be more used to free-ranging, playful interaction, and carry over that logic to the workplace interface. They will likely expect a certain level of personality from the interface and the ability to converse with it outside of task completion activities. The opportunity here is less in demonstrating potential, and more in proving quality — of coherent character, of insightful and empathic responses, and of substantive, progressive conversations.
As more and more workplaces adopt voice as a mode for digital interaction, understanding the mental models of users and how and when they will apply those expectations to their experiences will be crucial to creating experiences that resonate with employees.
Interested in how voice can be integrated into your interactions? Myplanet specializes in orchestrated digital experiences across devices and modes— talk to one of our team members today about how we can help your business integrate the next generation of tools, today.