The Narrowing Rift: Voice UI and Conversational UI

If today’s voice operated devices AREN’T conversational, what does “conversational” even mean?

This is the second in a series of posts inspired by my time as a workshop speaker and attendee at Interaction 17 (February 2017, New York City).

In my last post, we talked about the state of voice user interfaces (VUI) at this moment in time. Voice user interfaces have gone mainstream and are changing lives and increasing accessibility for many consumers.

At the same time, today’s voice user experiences (best known as Cortana, Alexa, and Google Home) remain rooted in a very simple, command-and-control methodology. We can only call the current experiences “conversational” in the broadest sense of the word — as spoken words exchanged.

A popular and sometimes contentious topic during the Interaction 17 proceedings was conversational UI (CUI). In general, this currently refers to chat bots and other written-input user interfaces. A frequent question raised: is Alexa conversational? How do these devices fall short of human standards?

Conversing with Alexa during my time on the Alexa voice design (VUI) team.

Facebook’s Messenger Bots are the most well-known example of conversational UI these days, although several public Twitter and Slack bots fit the CUI description. Notably, chat bots are almost universally implemented via graphical output and text input, rendering them still fundamentally different from voice UIs… for the time being.

Defining Conversation

How do we define conversation after taking it for granted our entire lives? Paul Grice published 5 maxims for conversation in a fairly dense paper on the subject.

In his IxD17 talk “Conversation is More than Interface”, Paul Pangaro applied Gordon Pask’s Conversation Theory to define conversation as Context, shared Language, Exchange, Agreement, and Transaction.

Paul Pangaro sets a conversational tone during his #IxD17 CUI talk.

Further, the typical outcome of conversation is beyond direct action: it is often the building of a shared history and trust.

This is where we fall short in today’s voice systems: they are fairly ignorant of a shared history, and have no concept of how they might engender trust.

On the subject of trust, researcher Christina Xu shared an important insight regarding Chinese digital culture in her talk “Convenient Friction: Observations on Chinese UX in Practice.” In that environment, conversational interfaces are routinely used for commerce, since they perceived as more trustworthy. And yet, those interactions in China are still generally run by actual people. What could we learn about trust in commercial conversation transactions in other cultures to inform conversational UI?

Christina Xu walks us through the extensive use of WeChat in Chinese culture for conversational transactions.

Back to Paul Pangaro’s talk: he further expanded on Gordon Pask and Hugh Dubberly’s work, describing four basic conversational frames. Two of those frames can be easily found in the current generation of VUI.

Controlling: specifying a goal with means of achieving it (“Play my Prince station on Pandora.”)

Delegating: asking for an outcome without specifying how to achieve it (“Play some uptempo music.” )

At the same time, two other conversational frames were described that go beyond most voice user interfaces today:

Guiding: discussing the means of achieving a goal (“I want to hear some music. How should I do it?”)

Collaborating: mutually deciding on goals between both participants. (“What should we do?”)

These less-common frames would be more helpful in situations where the customer is less experienced with the system, and indeed training and onboarding are big hurdles for today’s systems. And what if the customer’s goal is simply to be entertained? There’s still a certain something missing.

Craftsmanship in Conversation

Once we’ve built a framework for conversation, we must paint in the details — writing the actual text delivered in the exchanges.

Later in the CUI session, researcher Elizabeth Allen walked us through how Shopify uses cross-channel bots to emulate a marketing employee’s exchanges back in North America. These bots reach out via text based channels to offer to launch Facebook ad campaigns based on sales trends. Even though these were strictly graphical/text interactions, some customers began to reply to these bots as if they were actual people.

And yet, Elizabeth brought a few key cautions that can shatter this suspension of disbelief. In particular, customers can find these bots pushy if the timing and length of responses are not carefully tuned.

Our brains don’t give text-based conversational UI the anthropomorphizing “benefit of the doubt” that we apply to voice-delivered user interfaces. This puts greater pressure on CUI designers to be writers, keeping an eye towards creating the illusion of engagement. Voice UIs with good text-to-speech synthesizers sometimes get this illusion largely for free.

In a later talk in the CUI track, designer Whitney French called out 5 metrics for creating engaging conversational UI: intelligence, flow & cadence, helpfulness, personality, and utility. While these are all subjective metrics, the most difficult to emulate is personality; humor in particular is highly subjective. These metrics can also be applied to today’s voice UIs, but the burden of brevity is greater for spoken UI.

These metrics do give us a good framework for building what may be a (subjectively) engaging conversation. And it’s a fine line to walk. Most conversational UIs probably seek to be comfortable, but not fully anthropomorphized. Yet for spoken UI, it is extremely hard to prevent the brain from viewing the source of the conversation as human. What does this mean for the coming collision of conversational UI and voice UI?

Cautionary Creepy Dolls

Let’s take a VERY recent example: My Friend Cayla. This toy doll is now banned in Germany as illegal to sell, and the government has gone so far as to order parents to destroy the toy. What went wrong?

Cayla functions in a very similar way to other voice user interfaces on the market. To understand childrens’ speech, she transmits audio files over the Internet to a cloud service. Once she understands the speech, she generates a response in a synthetic voice using a text-to-speech system, and that audio file is sent back to the toy for playback.

My Friend Cayla, a doll with a voice user interface that has come under fire at a governmental level as a tool enabling illegal espionage (image from the Google Play store)

Unfortunately, Cayla doesn’t seem to adhere to the same stringent security standards that Amazon, Microsoft and (I hope and assume) Google applies to these conversations. They intentionally do not market to children since there are significant ethical issues when a child conducts conversations that can be recorded. Furthermore, the doll’s Bluetooth connection was found to be insecure, allowing attackers to use the toy for monitoring or even communication with the child.

Some of the lessons learned here are simple infosec lessons: be cautious when taking input with children, and make sure that any device equipped with live microphones or cameras CANNOT be controlled by third parties.

But there’s also an important lesson for CUI designers here: if we are too good at our jobs, could we put our customers at risk? Elizabeth Allen mentioned in her speech how Shopify observed their CUIs occasionally eliciting more information than is necessary. One presumes this is thanks to the sucessful illusion of a human conversation. Children are faster to suspend disbelief, so the ethical issue is more pronounced. What might they tell a doll (or a digital assistant) that they trusted? Their address? Financial information? Or worse?

With Great Power Comes Great Responsibility

To quote the fictional Ian Malcolm from one of my favorite films, Jurassic Park:

“ …Your scientists were so preoccupied with whether or not they could that they didn’t stop to think if they should.”
Every important discussion is improved with a little bit of Jeff Goldblum.

As our voice-based digital assistants move beyond rudimentary voice exchanges and begin to move towards more conversational spoken UI, we as designers will confront more ethical considerations.

Just a few of the questions we’ll face as VUI and CUI collide:

  • When is it appropriate for these systems to behave in a “human” way, and what does that mean?
  • How up front should conversational systems be about their synthetic nature?
  • How much control should customers have over what information about them is tracked in a conversational context?
  • What damage could be done if a customer overdiscloses to a voice UI capable of surveillance, believing it to be human?
  • For privacy-minded customers who will not consent to long-term learning and tracking, how can conversational UIs still provide value?
  • Can anyone truly trust a conversation partner who is designed ultimately to drive sales?
  • Can an assistant that customizes its personality to suit the customer be trustworthy? Are we trusting the brand, or the adapted personality?

While I believe we should continue to pursue a more conversational world in voice UI, I also believe it should be done responsibly. In these challenging times, how can we use the great power of voice user interfaces and conversational understanding to do the most good?

Wading into Deeper Waters

In the last post, we talked about how empowering voice user interfaces are to a wide variety of customers underserved by visual/physical UI.

A key takeaway from Interaction 17 for me was a more formal taxonomy of the parlor tricks that can make our VUIs seem more conversational in nature in the short term. For voice UI designers looking to improve the craftsmanship in their system’s spoken replies, the conversational UI insights above provide a good starting point.

But in many ways, the sessions raise more questions than they answered. In my next post, we’ll dive even deeper into several of the blind spots that current voice user interfaces must address if they seek to become truly conversational beyond command and control.

May the voice be with you.

After several years focusing her design efforts on NUI and VUI at Microsoft and Amazon, Cheryl is currently Design Lead for the Azure Portal + Marketplaces at Microsoft. Find out more about her diverse background and portfolio at her blog or on LinkedIn. You can also follow her on Twitter.
One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.