Trusting the Minds Behind the Voices (Full Paper)
Vocal Conversational Agents
This article is a written as a full paper for our Interaction & Service Design Concepts: Principles, Perspectives & Practices 2016 class, which was run by Molly W Steenson at Carnegie Mellon University, School of Design. You can also read the short version.
“Hello, World!” is the first program that any programming language beginner usually learns how to print it to the computer screen. The phrase signifies a birth of a machine that is just about like a kid with an extraordinary intelligence for its age. In this example, its parent, the user, asks the computer to display a phrase and the computer immediately recognizes what user is talking about. Then it responses back by mimicking and outputting the phrase to the screen. First impressions like these maybe the reason that we, humans, tend to have positive expectations for the machines that talk to us or mimic us such as chatbots, ELIZA and PARRY or more recently voice conversational agents like Apple Siri, Google Now (Assistant), or Microsoft Cortana. Although we tend to trust these intelligent agents at first hand, our continued experience with them often ends up with frustration or distrust, which makes us stop using them. It may be because we realize that their smartness does not last long, like the proverb, “Early ripe, early rotten” says.
In this article, I argue that in order to enhance the users’ adoption rate of a conversational agent, it is essential to design a trustworthy experience. Therefore, this paper aims to highlight the recurring patterns of trust and distrust on conversational agents that may explain why humans are afraid to rely on their communication with intelligent machines by examining the incidents on the online news sources. Along the article, I trace the assumptions of the creators (or designers) of these agents and analyzes the implications of these assumptions in the context of “trust-building”. I will discuss concepts such as data collection, data accuracy, the tone of personality, transparency, and perceived privacy by following a timeline that analyzes the incidents of notable conversational agents from 1960s to 2016 through the best practices in the area. Before starting to look closer into recent examples of the conversational agents, let’s start with defining what is a conversational agent and exploring its early examples.
ELIZA and PARRY: The Psychotherapist meets The Patient
A conversational agent, or a dialog system, is a computer program that tries to understand and reply to a human in a one-on-one dialog. Its history dated back to 1966, when the first text-based chatbot ELIZA was introduced. ELIZA was a computer program for studying how machines can talk with humans in an expected, natural way. It is considered one of the first examples of natural language processing that uses the rule-based conversation structures such as sentences. Different from the today’s mainstream conversational agents that assist their users with their everyday tasks, ELIZA was an experiment that was trained to reply its users by repeating their input to them just like a 60s psychotherapist, who was doing a Rogerian therapy. Figure 2 shows a sample conversation with ELIZA.
By even using a simple mimicking conversation pattern, ELIZA was very impactful. Weizenbaum, its creator, realized that participants without any technical background thought that ELIZA was a real psychotherapist and spent many hours in front of it, told it their very personal, intimate problems. When he informed the participants that he could read all the chat logs, all participants complained about their invasion of privacy because they expected that their experience with a machine that listens to them like a doctor, will be the same as a real one: trustworthy, private, and dyadic. After his interviews with the first users of ELIZA, Weizenbaum has acknowledged the potential negative implications of ELIZA to the users, who over-trust a chatbot and has decided to shut down the entire project.
After ELIZA, another chatbot, PARRY, which modeled a paranoid schizophrenic, was invented by the psychiatrist Kenneth Colby in 1972. The main difference between them was that PARRY had incorporated a mental (emotional) model in its code. This means that it had variables such as anger, fear, and mistrust, which can be increased or decreased depending upon the input of its user. For example, if the user has typed a word, that can be associated with the mafia such as police, drugs, guns, and jail, PARRY would increase its fear level and would try to change the topic of the conversation. Likewise, if the user input something implying that Parry was mentally ill, it would also increase its mistrust and anger levels, which would result in more aggressive replies. Figure 3 shows a sample conversation with PARRY.
As you can see from PARRY’s responses to the user, it wasn’t only replying to a single inquiry. It was also capable of forming its responses to the history of the conversation to give the impression of a human-like personality. This impression actually made PARRY become one of the early systems that passed “Turing Test” for examining machine intelligence by comparing its responses’ to the level of being indistinguishable with humans’. In PARRY’s test, the judges, who were a group of psychotherapists, were not able to distinguish PARRY’s answers from a real-world paranoid individual.
One year after PARRY’s introduction, Vint Cerf, who was one of the founders of The Internet decided to arrange a public long-distance meeting between ELIZA and PARRY via ARPANET, the ancestor of The Internet. Figure 4 shows an excerpt from their conversation.
While the level of skepticism in the words of PARRY was evident, ELIZA’s neutral choice of words gave the impression that it guides the PARRY in its (or his) inner journey to the childhood. Even the reaction of ELIZA when PARRY says something nonsense makes us the third person to believe that this may be a genuine conversation between two humans. We tend to believe in this way because we initially trust the things that look familiar to us and appear to be risk-free at the surface level, especially in terms of visual appearance. Not only familiarity, but also the positive characteristics of the conversation such as being good and honest, increases its trustworthiness level. Then simultaneously we look for expertise in terms of intelligence and experience. In other words, most of us may find the conversation genuine because we think that it “looks” credible enough.
Since the year 1973, when ELIZA and PARRY met, the technology behind the conversational agents has advanced a lot. With the domestication of low-speed Internet and personal computing, it was inevitable for the early 2000’s to be all about the text-based instant communication and messaging. During this period, the first mainstream instant messaging (I.M.) platform IRC was announced, task-specific conversational agents (or chatbots) appeared to help the editors with housekeeping and moderating their channels. Besides task-specific agents, generalist agents continued to pop up in the next wave of chatting, instant messaging (IM) platforms such as SmarterChild of AOL IM or Alice of MSN Messenger. These chatbots were capable of providing quick access to information such as the news, the weather, sports game scores and tools such as a calculator, a memo, and a translator. The transition between these text-based chatbots and voice virtual assistants somehow delayed because of slow network speeds, low computing power and relatively slow research on other emerging technological areas such as artificial intelligence.
Examining Today’s Agents: Smart, Women, Always-Listening
(TRYING HARD TO BE) SMARTER
In 2011, Apple, who was the leader of the mobile smart device market, introduced its virtual assistant named Siri. The fact that Apple was the market leader, Siri became the first conversational agent that made it to the mainstream almost one year before Google Now and three years before Microsoft Cortana. From the user-value viewpoint, the main aim of the Siri was to assist the users in their daily tasks in a more natural and familiar way: by talking with them.
By using natural language processing algorithms, Siri was programmed to understand its user’s speech input as if he or she was talking to another human, without requiring any additional effort. Siri set the expectations of being socially-intelligent for all virtual assistants. For example, if you were an iPhone user in 2012 and asked to your phone verbally “Who are you?”, getting an auditory reply from “a woman” was magical. Although you may have associated this woman’s tone with your mental model of “a personal assistant” at your first impression, you may not have been aware of her capabilities fully yet. This period was actually the time that users got to know Siri, who was being “raised” by an already reputable parent, Apple Inc. During this period, users, who had already found it credible enough and were ready to make a stab at it by assuming that Apple will not do something harmful, were also trying to decide if they are going to continue to use Siri in their daily lives. Should they have trusted Siri to be their personal assistant?
The early user explorations of the Siri showed that it was limited in vocabulary, which made it unpredictable in terms of its answers. Many complained about its speech recognition accuracy and the fact that it wasn’t any better than a voice-controlled search engine, as Apple has decided to use Microsoft Bing instead of Google to get answers for search queries since 2013.
Having a very strict query/result pairing algorithm has often made Siri and its descendant, Microsoft Cortana to be perceived as not useful since they have been programmed to redirect users to the result page if they don’t have a predefined answer for a query. On the contrary, Google Now has been following a different approach by answering a query by reading aloud of the related section from the first Google result, without evaluating the reputation or credibility of it. To compare Siri and Cortana’s approach with Google Now, imagine that you ask “Who is the king of the United States?” to the all three assistants in 2016, which should not factually return any “correct” answers since the United States is not ruled by a monarch. Unexpectedly, Google Now answers it by reading it aloud the phrase: “King Barack Obama. According to website searchengineland.com, asking Google…” while Siri answers it by saying “The answer is about 1.57 billion ancient kings.” and showing a unit conversion table, which does not make any sense in this context. The closest to the expected answer comes from Cortana, which redirects the user to the search results page on Bing.com (Fig. 5). This extreme example may outline the boundaries of each assistant’s capabilities but also shows the level of credibility of the output, each virtual agent can produce. While Google Now and Siri are competing to be the most practical one, Cortana’s answer is the most factually and politically correct one, which some users find not useful. From a communications lens, it is at least not navigating the user to questionable information. Having talked about the credibility of today’s virtual assistants, their level of understanding the context is also important, in which all three are not doing well, but Siri has had a “success” story that is also worth to mention.
Even though today’s five-year-old Siri, can handle longer conversations, understand the context better, comparing her “infancy” to her present state can show how she (it) was able to gain the trust of some of its users. For instance, four-month-old Siri (2011) had been replying to “Siri, call me an ambulance” query as “Okay, from now on I’ll call you ‘an ambulance’.” (Fig. 6), before her parent, Apple, “taught” her to interpret this query as an emergency situation. Although the initial intention behind asking this question is to see how she will react to it, a user may also seek her help from Siri in a life threatening situation, which shows the level of trust and faith in it (or her). In fact, this scenario actually became a reality in 2016, when a user shouted to her iPhone to call for an ambulance while running to another room for her child, who stopped breathing. This time, Siri worked as expected and gained the lifetime trust of its user with the help of its “always listening” feature, which enabled Siri to be always on-record in the background to catch its wake-up phrase, “Hey, Siri!” from its user.
Before discussing the implications of living with agents that always listen to us for the sake of hearing their names, it is more vital to discuss how having a gender or/and a personality affects the reliance (and trust) that we place in these agents.
BEING A WOMAN (MOST OF THE TIMES)
Along with improvements in data response accuracy, Siri’s voice also gained the “gender options” in 2013. Although its default “female adult” voice remained same in many countries, users were able to change the gender of its voice to a male speaker. Since Siri’s release, the reasons behind why Apple went with a female voice has widely discussed. Behavioral science simply explained it as “people tend to find “feminine” voices more likable than masculine.” Similarly, Clifford Nass argued in a more biological way that human brain is programmed to like female voices, starting with being a fetus that reacts to the mother, but not the father argued in a more biological way that the human brain is programmed. However, this is not totally correct. Research shows that fetus can hear and react both mothers’ and fathers’ voices at the same time but since mother’s voice is closer to the womb and transmitted through bones to the fetus’ ears, it is indeed the clearest. I partially agree that this may explain the tendency towards favoring a female voice over a male but more research needed to pass judgement on how a female voice affects the trust of conversational agents’ users. [mws1] A more socially-oriented explanation is pointed out by Judy Edworthy, a professor of applied psychology, is that historically, women’s voices were firstly adopted in airplane cockpits to prevent them from interfering with the internal communications of pilots who were mostly men. Despite its social and cultural validity on the majority of cultures, the female voice has not always worked as it should be. For example, a German car manufacturer BMW once had to recall all their cars that have a female voice in their navigation system because of “male” customers that refuse to take commands from a woman. Indeed, Nass and Brave argues that users find the embodied (physical) conversational agents with a male voice, more authoritative, while agents with female voice perceived more emotional.  Maybe for a similar reason, Apple also initially introduced a masculine male adult voice as Siri’s first default voice in the UK, which changed later to a more gender-natural-sounding voice that “made its users upset”.
In a broader sense, when we think about the voice of conversational agents, our discussion becomes deeply intertwined with social issues like personality and sexism, since these agents are developed to interact with humans socially. Their creators most often make some of the assumptions explained above to gain the trust of as many users as possible, which often works, but on some occasions creates negative implications such as enabling “lonely” individuals to abuse the agent’s personality, for example by talking to it in a sexually explicit language, similar to the movie Her, in which a lonely scriptwriter becomes obsessed and emotionally inseparable from his voice virtual assistant. Again, this is a sign of over-trusting an anthropomorphic machine, which shows human-like qualities like a personality to get attention but also gets a deep emotional response for the qualities that it does not have, such as a sex life.
Creators of these conversational agents should be aware that they are walking on eggshells, if they decide to give a personality to their agents, especially for the ones that handle general conversations such as chitchatting. Being informed about the implications of assigning a personality to an assistant, now, it is easy to guess why Google Now (or Assistant) uses a female voice without a personality. In this way, Google Now feels like a “female” assistant but nothing more. Even its gender-neutral trigger-phrase “Okay, Google” denotes the affordances that you can interact with it. But the problem with Google Assistant is not actually about its personality, it is about how and when it collects and represents data, which brings us back to the discussion around the “always-listening” conversational agents.
Although the idea of triggering a device’s functions by listening certain keywords dates back to 1970, Google was the first company, which used the idea of “waking up” an agent by constantly listening for a trigger-phrase (also known as a hot word or a wake-up word). Along with Microsoft, which introduced always listening feature on its video gaming console “Xbox One”, they were criticized about the implications of the always listening feature on their users’ privacy. They responded to these privacy concerns as “always-listening does not mean always-recording” and made statements to assure their users that there is no recording being done and no data is being sent to any external locations (i.e. companies’ cloud servers) before the trigger-phrase recognized. After Google and Microsoft, Apple also introduced an optional always-on feature to activate Siri at any time. More recently, both Google Now, Siri, and Cortana have embedded in the companies’ desktop computer operating systems, which enable them to potentially listen to every conversation that takes place around billions of computers all over the world.
A detailed, well-written review about “the privacy implications of microphone-enabled devices” can be found in Stacey Gray’s report for the Future of Privacy Forum. For these devices, she suggests the following principles in order to establish consumer trust, which provides a good re-cap of the emerging privacy concerns about the microphone-enabled devices. They are also ideal for concluding this section along with agent specific critiques.
Firstly, she thinks that “companies of these devices should be transparent about the data that they collect from the users and their environments.” In parallel, although it is not sufficient, all three companies (Apple, Google, and Microsoft) have an accessible, easy to read privacy pages on their websites that touch this issue. Secondly, she states that “devices should not arrive with always-on feature pre-enabled if their interaction model is not only a microphone.” In conjunction with this, none of the agents are coming pre-enabled besides their desktop versions and enable users to opt out their always-on feature in their 2016 versions. Thirdly, she points out that “devices should incorporate a way for disabling always-listening features without turning off the completely.” Different from first two principles, none of three agents provide an interface to temporarily disable their always-listening feature. In order to disable, the user has to go to the setup section for each agent and opt out from the feature, which is not within easy reach by design. Fourthly, she expresses the devices, incorporating these agents, should indicate clearly and visually if they are listening or sending data to somewhere, which all three agents fail to provide feedback about the status of “always-listening” activity in passive use. Fifthly, she argues that companies should not collect any personal data that can be linked to an individual, which again all three agents known fail to meet. All of them are known to collect and use personal data such as browsing history or advertisement identifiers to personalize the service, which is a privacy trade-off for a user, who wants a more personalized experience. Finally, she put forwards that companies should enable individuals to delete the stored audio files from their (cloud) servers, which out of three, only Google provides an option to remove users’ voice signature from their system. This helps to provide a level of transparency, which Microsoft and Apple should also consider providing in order to improve their perceived trust level.
On the whole, I have provided a trust based outlook on conversational agents through studying the assumptions and implications of the promised features of the prominent real-world examples — from first chatbot ELIZA to the first mainstream vocal virtual assistant Apple Siri. In order to illustrate the intertwined relationship of users’ trust and adoption of vocal conversational agents by users, I analyzed how these agents became smarter, sexist, and always-listeners through examining the incidents that are publicized through online news channels and media outlets. While these agents pushed slowly to become an indispensable part of our everyday lives, I argue that (designers in these) companies should work towards creating reliable experiences by creating trustworthy interactions between their users and agents if they aim for reaching more users and increasing their products’ adoption rates.
 S. C. Johnson and B.W. Kernighan, “The Programming Language B,” 1997, https://www.bell-labs.com/usr/dmr/www/bintro.html.
 Justine Cassell et al., “Animated Conversation,” in Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques — SIGGRAPH ’94 (New York, New York, USA: ACM Press, 1994), 413–20, doi:10.1145/192161.192272.
 Dan Jurafsky, “CS 124/LINGUIST 180 From Languages to Information (Lecture Notes),” 2016, http://web.stanford.edu/class/cs124/.
 Weizenbaum, “ELIZA. A Computer Program for the Study of Natural Language Communication between Man and Machine.”
 Guven Guzeldere and Stefano Franchi, “Dialogues with Colorful Personalities of Early AI,” SEHR 4, no. 2 (1995), https://web.stanford.edu/group/SHR/4-2/text/dialogues.html.
 Jurafsky, “CS 124/LINGUIST 180 From Languages to Information (Lecture Notes).”
 Jason L Hutchens, “How to Pass the Turing Test by Cheating,” School of Electrical, Electronic and Computer , 1996, 23, csee.umbc.edu.
 Vint Cerf, “PARRY Encounters the DOCTOR,” 1973, https://tools.ietf.org/html/rfc439.
 Megan Garber, “When PARRY Met ELIZA : A Ridiculous Chatbot Conversation From 1972,” The Atlantic, 2014, http://www.theatlantic.com/technology/archive/2014/06/when-parry-met-eliza-a-ridiculous-chatbot-conversation-from-1972/372428/.
 Ian Li et al., “My Agent as Myself or Another,” Proceedings of the 2007 Conference on Designing Pleasurable Products and Interfaces — DPPI ’07, 2007, 194, doi:10.1145/1314161.1314179.
 B. J. Fogg and Hsiang Tseng, “The Elements of Computer Credibility,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems the CHI Is the Limit — CHI ’99 (New York, New York, USA: ACM Press, 1999), 80–87, doi:10.1145/302979.303001.
 DiverGuy@EFnet, “Frequently Asked Questions about IRC Bot , Internet Relay Chat roBOTs,” 2003, http://www.irchelp.org/misc/botfaq.html.
 Robert Hoffer, “The Trouble with Bots: A Parent’s Musings on SmarterChild,” Venturebeat.com, 2016, http://venturebeat.com/2016/06/15/the-trouble-with-bots-a-parents-musings-on-smarterchild/.
 Zoe Kleinman, “Apple’s Siri Calls Ambulance for Baby,” BBC News, 2016, http://www.bbc.com/news/technology-36471180.
 Nathan Cheeley, “Siri! Call an Ambulance! | Nate Says,” Nathan Cheeley Personal Blog, accessed November 16, 2016, http://www.nathancheeley.com/siri-call-an-ambulance/.
 Wade J. Mitchell et al., “Does Social Desirability Bias Favor Humans? Explicit-Implicit Evaluations of Synthesized Speech Support a New HCI Model of Impression Management,” Computers in Human Behavior 27, no. 1 (2011): 402–12, doi:10.1016/j.chb.2010.09.002.
 Brandon Griggs, “Why Computer Voices Are Mostly Female — CNN.com,” 2011, http://www.cnn.com/2011/10/21/tech/innovation/female-computer-voices/.
 Kristin M. Voegtline et al., “Near-Term Fetal Response to Maternal Spoken Voice,” Infant Behavior and Development 36, no. 4 (2013): 526–33, doi:10.1016/j.infbeh.2013.05.002.
 Bruce Feiler, “Turn Right, My Love — The New York Times,” The New York Times, 2010, http://www.nytimes.com/2010/06/27/fashion/27FamilyMatters.html.
 Jessie Hewitson, “Siri and The Sex of Technology,” The Guardian, The Women’s Blog, 2011, https://www.theguardian.com/lifeandstyle/the-womens-blog-with-jane-martinson/2011/oct/21/siri-apple-prejudice-behind-digital-voices.
 Clifford Ivar. Nass and Scott. Brave, Wired for Speech : How Voice Activates and Advances the Human-Computer Relationship (MIT Press, 2005).
 Salvador Rodriguez, “U.K. Users Upset after Apple Changes Voice of British Siri in iOS 7.1,” Los Angeles Times, 2014, http://articles.latimes.com/2014/mar/14/business/la-fi-tn-apple-british-siri-ios-71-20140314.
 Deborah Harrison, Cortana — RE•WORK Virtual Assistant Summit #reworkVA — YouTube, 2016, https://www.youtube.com/watch?v=-WcC9PNMuL0.
 T.C. Sottek, “The Moto X Will Always Be Listening for Your Voice Commands, Leaked Video Shows,” The Verge, 2013, http://www.theverge.com/2013/7/14/4522054/moto-x-always-listening-voice-commands-new-notifications.
 T.C. Sottek, “The Xbox One Will Always Be Listening to You, in Your Own Home (Update),” The Verge, 2013, http://www.theverge.com/2013/5/21/4352596/the-xbox-one-is-always-listening.
 Elizabeth Weise, “Hey , Siri and Alexa : Let’s Talk Privacy Practices,” USAToday, 2016, http://www.usatoday.com/story/tech/news/2016/03/02/voice-privacy-computers-listening-rsa-echo-siri-hey-google-cortana/81134864/.
 Stacey Gray, “Always on: Privacy Implications of Microphone-Enabled Devices,” 2016, https://fpf.org/wp-content/uploads/2016/04/FPF_Always_On_WP.pdf.