Tama: A Gaze Aware Smart Speaker
The current generation of speech agents and smart speakers are a long way from the disembodied, seemingly omniscient intelligences popular in fiction.
While the interactions depicted with Star Trek’s ‘Computer’ or Samantha in the movie Her seem to be an influence to those designing and building these systems, the interactions we have with Alexa, Siri, Cortana or the Google Assistant are ‘miles wide but inches deep’ [1].
When compared in any way with the conversations we have with other people, this becomes obvious. Some smart speakers have recently started to react on people coming close to the device [2], but gaze, facial expressions, bodily orientation, body torque, gestures, and backchannels are all ignored.
Current systems are very much embodied in the hardware that provides the microphones and speakers for us to interact with them, even if the interaction tends towards presenting the agent as a disembodied voice. The most simple example of this being that if you are too far away from your smart speaker, or your phone is muffled in your bag or pocket, no amount invoking the assistant with its ‘wake word’ — usually ‘Alexa’, ‘Ok Google’, or ‘Hey Siri’ — will be successful.
But we have the opportunity to take advantage of this situated physical manifestation, even if most of the system runs on the cloud in datacentres. In some ways, commercial agents are moving towards this. Televisions with Alexa built in provide command and control functions tied directly to that television set in a similar way that agents embedded in cars provide access to functions that would not be available to an agent being accessed through a phone that just happens to be in the car — but these focus on providing functionality tied to that specific device rather than the interaction with the agent itself. The fact that the agent has a ‘body’ to focus our interaction towards allows us to revisit some of the parts of human to human conversational practices that are currently being ignored.
In our work we have started with eye gaze. Our prototype smart speaker Tama, built as part of a collaboration between Stockholm University, Sweden and the University of Tsukuba, Japan, knows when you are looking at it — and has the ability to look back.
As an initial change in the interaction, this lets us do away with the wake word. No longer do our users have to repeat ‘Ok, Google’ before querying the system — when testing Tama they were able to simply make eye contact with the device, and when it looks back at them they can ask their question to the Google Assistant service.
This draws on findings from the field of Conversation Analysis, which points out that the use of someone’s name in conversation is an unusual, and meaning laden, act. Continually having to say ‘Hey, Siri!’ or ‘Alexa!’ to interact with a computer system not only feels unnatural, but when that system use is interwoven with conversation and activities involving others has the potential to disrupt the natural ebb and flow of social interaction.
One way that we evaluated the use of Tama was to time the interaction. We took the time between the user starting a query and the time the system spoke its answer as an indication of how much trouble the user had during that interaction. What we found was an interesting split in the times. On average, using the wake-word was faster than using the gaze interaction (7.8 seconds vs 10.9) , yet when we looked only at interactions that didn’t have problems this narrowed to 1 tenth of a second (7.4 seconds vs 7.5). This suggests that when the gaze interaction worked, it worked just as well as the wake word.
The problems resulted from a mixture of technical and interactional factors, but all of them pointed to interesting ways in which our participants interacted with Tama.
The two most prevalent ones were times when the participants struggled to catch the eye of the robot and begin the query, and times when the query failed and they had to repeat — sometimes more than once. To better understand what these problems meant for the interactions we used qualitative analysis of the videos of the trials, drawing on conversation analysis and ethnomethodology.
Catching the eye of Tama is slightly more difficult than doing so with a human conversational partner. The two eye-gaze detection modules built into Tama are sensitive to lighting issues, hair styles or clothing that confuse the contour of the head, and tilting of the head moving the eyes out of alignment with each other. What this exposed though, was the speed and fluency with which our participants learned to articulate their eye-gaze to interact with the system. Participants for whom the cameras had a hard time detecting shifted and tilted until they understood how to best convince Tama to make eye contact with them, and participants for whom the cameras had no problems detecting learned to avert their gaze to allow their partner to use the system only looking towards the device after their partner had completed their query or when they wanted to ask one of their own.
The other type of interaction problem, that of repeated queries, sometimes resulted from the same issues with gaze detection — but it oftentimes was a result of the interaction design decisions we had taken around what was a ‘true’ attempt at asking a question and what as simply an incidental glance towards Tama. To stop the system interrupting the conversation by mistake we decided that while the user was speaking they should look at the system at least once every couple of seconds — which worked to reduce the number of false positive activations to almost zero, but caused its own problem, namely that when users started to talk and looked away to think or confirm something with the other participant Tama would think they had mistakenly started the query and cancel.
Although as before they learned very quickly that looking at Tama also meant keeping Tama listening to them they still forgot occasionally, and were forced to restart the interaction.
Going forward we hope to build on this work in two ways. First, to improve and expand the interactions with Tama by including more nuance in feedback using gaze rather than the colour of the eyes, and to detect more nuanced patterns of looking towards and away from the device during the conversational turn. Second, we hope to expand the use of gaze interaction for Internet of Things devices.
For more details see the full paper, published at CSCW 2019
Donald McMillan, Barry Brown, Ikkaku Kawaguchi, Razan Jaber, Jordi Solsona Belenguer, and Hideaki Kuzuoka. 2019. Designing with Gaze: Tama — a Gaze Activated Smart-Speaker. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 176 (November 2019), 26 pages. DOI: https://doi.org/10.1145/3359278
[1] https://www.theverge.com/2019/11/6/20951178/amazon-alexa-echo-launch-anniversary-age-funtionality-not-changed-use-cases
[2] https://voicebot.ai/2019/11/07/google-nest-hub-adds-ultrasound-detection/