Exploring voice interfaces for healthcare applications.

Matthew Harrison
helixcentre
Published in
13 min readApr 11, 2019

Advance care planning is an empowering but emotionally challenging activity. And the more ways we have to help people do it, the better. At Helix Centre, we created Amber Plans, a digital platform for creating Advance Care Plans, to ease the difficult process of planning end of life care. Now being developed by spin-out Digital Care Planning Ltd, our human-centred design approach has created an excellent web-experience, and we have great feedback from our users. But we want to make it more accessible and find new ways to engage people, especially those who are less fluent with technology. Could voice interfaces provide an accessible, intuitive and compelling way to help people create plans?

Advance care planning is the process of documenting your wishes for care in the event that you lose the capacity to make decisions for yourself. It is an important thing to do, but few people do because many perceive it as difficult to talk about. Family members often shy away from the subject, and GP appointments are not long enough for a ‘proper’ conversation.

We took the opportunity to find out if voice interfaces could play a valuable role in helping people consider and document their preferences for end of life care. We wanted to increase the insight we could get from a small number of user-testing sessions, trying different approaches to two prototypes on a Google smart speaker.

The first prototype was a quiz on the subject of end of life care, intended for two people to interact with together. We hypothesised that the shared experience of doing the quiz would break down the perceived barriers to the subject, dispel some common misconceptions, and then lead to a discussion between the two users about their personal preferences. We do not believe that voice technology is currently sophisticated enough to conduct the full, nuanced conversation, but perhaps it could set that conversation up?

The second prototype was aimed at a solo user and was intended to help the person develop their own points of view for creating an Advance Decision to Refuse Treatment (ADRT). This is a legally binding document that dictates an individual’s choices for refusing treatments (such as CPR) under certain conditions. It is a powerful tool for people with terminal diseases who do not want their life to be sustained at the expense of comfort or dignity. This second prototype described opposing points of view relating to a series of hypothetical scenarios — dementia, coma, terminal disease. Each opinion was presented as the point of view of a fictional person. By asking the users whose opinions they felt closer to, we would be able to suggest a template ADRT that the user could then personalise through our website. The hypothesis was that by describing two opposing perspectives we could help the user consider and crystallise their own opinion.

In short, the first approach was general and open in nature, and the second was specific and private. Which would be the most effective use of the technology for our intended users, including people with a life-limiting diagnosis?

Prototyping

Designing for voice was a conceptual jump for someone used to designing visually for screens and physical devices. It’s an emerging medium, and the tools for visualising conversations are in their infancy. We started building up conversations in Dialogflow, but quickly wanted to relate these to a visual map to help us plan the conversation flow and strategy. Realtimeboard (now Miro) proved very useful, once we had developed a repeatable ‘symbol’ that reflected the construction of a conversational ‘intent’ in Dialogflow.

Fig 1. Masterplan for the ADRT conversation flow, developed collaboratively on Realtimeboard.

Despite my limited coding ability, I was able to make a workable prototype in Dialogflow. This enabled me to learn the characteristics and capabilities of the medium, before carrying out some preliminary user-testing on the prototype. I planned and represented the model for the conversation on Realtimeboard. As the requirements developed and the need for sophistication increased, I could then hand over the Dialogflow prototype to software and AI engineering colleagues. We kept the Realtimeboard as a central collaborative focus for communicating iterative changes to the prototype.

The software developers could then enhance the Dialogflow prototype for natural language processing and use custom-made webhooks to deliver the full functionality we wanted. This included some simple sentiment analysis to enhance the perceived emotional intelligence of the voice agent. Transferring some of the logic to a back-end also made the iterations of the prototype quicker to implement and deploy than going through the web-interface of Dialogflow.

Fig. 2 An example ‘intent’ as we represented it within a Realtimeboard flow chart. This layout enables us to visualise the various elements needed for the conversation agent to function, including contexts, entities, responses and user inputs.

Now with working prototypes, we began testing them first with colleagues and friends in our immediate orbit, and then users representing our target group of people living with a serious or terminal illness. These sessions gave us a myriad of valuable insights, not only into the technology itself but the user-testing experience, too.

What we learned

Accessibility. Voice interfaces proved to be approachable and accessible to a wide spectrum of users with varied experience and computer technology ability. Once the technology is installed and activated for users, they can relate with ease to the interface, and they have reasonable expectations of the technology. Our testers included people with minimal computer experience, and one who did not own a mobile phone, yet this user found the interface intuitive and enjoyed the experience.

Voice only. Combining voice and text together has advantages in helping users respond in a way that the voice agent can accept. But displaying large amounts of text, while also saying the text out loud, can cause the user to ignore the spoken word, or skip ahead.

Conversational pace. One characteristic of ‘voice only’ interfaces is that the pace of the interaction is controlled and steady, forcing a conversational cadence. This encourages users to think about their responses in a way we are accustomed to in natural conversation. The linear nature of a conversation can also serve to focus the mind on one question at a time. Offering text alternatives, or ‘suggestion chips’ for responses, encourages the user to skip ahead and treat the experience like a web form. Enforcing a natural conversational pace may be especially helpful when two people are engaging together with the interface.

Personal / impersonal. Machines can have advantages over humans in asking questions that would cause discomfort or embarrassment from another human. In the way some people find it easier to talk to a stranger about a personal issue, the same can apply to a voice agent. This makes a compelling case for a voice-based approach in this application of advance care planning. The tech can offer a human-like interaction, but without the discomfort of thinking you may be judged on your responses.

Not getting stuck. Once the user has repeated their answer a couple of times because the voice agent has failed to interpret it correctly, the voice agent tends to lose the context, get stuck and quit. It is therefore valuable to have fallbacks that ask the question in a different way to avoid the user giving the same unintelligible answer* or getting frustrated with the technology. A next layer of fallback may involve an elegant way to skip the question and carry on. As an ultimate fallback, it may be necessary to build in a way to avoid starting at the beginning if the application does quit, allowing the user to easily access the right part of the conversation again. Reliving a whole conversation again with a voice agent that doesn’t remember what you just said in the previous attempt is a very frustrating experience.

* In human to human conversation we have a fairly effective habit of repeating our answers louder, and with more brevity when we are not understood. This strategy, which we fall to naturally, is not ideal for being understood by voice agents which generally need more context rather than more volume or emotion!

Interruptions, or lack of. One key difference between conversation with current computer technologies and normal conversation between humans lies in interruption. This is a normal part of human to human conversation; we finish each other’s sentences, and stop each other going down tangents. Even little ‘yeahs’ and ‘huhs’ can encourage the other speaker and let them know we are listening. However the voice interface seems unable to listen while speaking. The voice agent’s reaction when it does detect perceived voice input is to go silent and listen. This exacerbates frustrations when misunderstandings occur, that can result in failure of the app.

Unnatural pauses. There is a challenge in asking human users tricky questions with a voice interface that requires the human to think about their answer. The voice agent is not able to wait patiently for an answer, but is prone to time out, or mishear a background noise. Even if you can programme the voice bot to wait patiently for an answer, the human may not have confidence that the machine is ‘still on’, still waiting for an answer, and may feel rushed into a response just to keep the dialog active. Without the benefits of body language, and encouraging murmurs, the eerie silence make pauses uncomfortable.

Multi-user conversation. The voice interface cannot deal well with a three-way conversation with two humans. The voice agent assumes everything said is intended solely for itself, and cannot necessarily differentiate between two users. This means that even a throw-away side comment between two users (perhaps commenting on how good the voice agent is) can put the conversation off track as the voice agent tries to interpret the comment as a response to itself. This can make situations where more than one person is using the device unnatural and frustrating.

Email addresses and phone numbers. We explored ways to provide continuity between a voice experience and a second interface such as a website. The first is to take the device account information (e.g. Google or Alexa account) to access an email address associated with the device. This is simple to do, but restricts the ‘user’ to being the person who set up the account and the device. This therefore defeats the ability to have multiple users access the service through the same device, and undoes some of the accessibility benefits of voice over other technologies. Furthermore, there may be privacy implications, if the user is not aware which account the device is signed into while they are using it.

The second approach is to ask for contact or follow up information through the voice interface itself. This gets very difficult as many emails are near impossible to say phonetically in a way that a voice bot could record (try dictating the email address “james_brown-eighty4@gmail.com” to a voice bot). Our experience of dictating phone numbers to Google Assistant was similarly problematic

If the voice agent cannot collect either an email address or a phone number, the options remain limited to providing the user with a redemption code that they must note down and remember to act on. This challenge is currently a major hurdle to us in taking our work to the next stage. GDPR rules also have an implication for how you collect and process data. Gaining consent invariably means visiting a website and ticking some checkboxes, but the processes involved in this make the simple voice interface less accessible.

Personas. To use Alexa / Google’s persona, or create a new one? A lot of the guidance around voice interfaces talks about creating a personality for your service, and using stylised language to express the brand and user experience you wish for the product. This is a valid and important design consideration. But to complicate things, we found that despite trying to create a persona (named ‘Amber’) for the user to engage with, conceptually our users did not differentiate Amber from the Google Assistant’s persona. Users referred to their conversation with Google rather than Amber. Similarly, users’ feedback on the experience blurred the distinction between our app, and the voice platform itself.

An app on a smartphone is easier for the user to distinguish from the phone and the OS than the similar parallel on a voice platform. For example, when using Spotify on an Alexa-powered smart speaker at home, it does not feel like you are speaking to Spotify itself. It feels like you are asking Alexa to control Spotify on your behalf. We see two approaches to this; embrace existing personas and make the transition from Alexa to Amber seamless (‘Alexa, help me make an Advance Care Plan’); or exaggerate the differences to help users understand the new context (‘Alexa, let me talk to “Zach” about Advance Care Plans’).

User-testing

When creating a voice agent, you can very quickly become desensitised to the quirks in the conversation as you repeatedly test it yourself. Unintentionally, you become expert in avoiding the pitfalls of your designed conversation. There is therefore great value in testing your agent with new users at every stage of development. .There are two levels of usability challenges with a voice prototype. The first is a matter of the mechanics of the way you have built it; does it make logical sense? Do people understand it? Can you physically have a conversation with it? The second is a matter of context and suitability to the desired user group. If access to the desired user group is scarce or expensive (as is often the case in healthcare), then you can maximise the value of this resource by doing extensive ‘mechanical’ testing on other users (colleagues, friends, family, general public). You may need to give them suggested answers to make the context accessible to general users, but you can then take a more refined and mechanically sound prototype to your valuable context-relevant users.

With our advance care planning prototypes, we tested their ‘mechanical’ functionality with colleagues before taking the prototypes to the homes of target users living with a life-limiting diagnosis. For example, we established with general users that listing multiple choice options as “1,2,3” was more effective than “A, B, C” — because the voice agent could identify the answers more easily. (C = ‘C’ or sea or see, A = Eh?, B = be or bee).

We also made sure the prototypes could understand positional answers. For example:

Device: Is your answer option 1, 2, or 3?
User: The middle one
Device: Yes. Option 2 was correct, well done.

The prototype we put before ‘real users’ was therefore less likely to get stuck on such logical issues. We were also able to optimise the length of questions and information statements, and improve the logical flow of the conversation. Our ‘real user’ sessions were therefore more focussed on what our users thought about the applicability of the technology to their needs, and the suitability of the tools for the task at hand (making advance care plans).Or sessions also demonstrated the value of being present during testing. It’s tempting to rely on remote or unsupervised testing for voice agents, on the basis that you can see the details of the conversation in the voice agent log files. But by witnessing the conversation only through the lens of the log files, you are missing a wealth of insights in observing directly the frustrations and delight of the real interaction. Furthermore, as the technology can be unreliable, especially with early voice prototypes, some users may need assistance in accessing parts of the service, or becoming unstuck.

Audio recording live conversations between humans and voice agents is also a very effective way to enable future analysis.

Fig 3. User-testing in the Helix studio. We used a Google Home Hub, but with the screen turned away from the user. This tested a ‘voice only’ experience, but with extra real-time insight to how the Google Assistant was responding to input. The touchscreen also allowed rapid recovery when the prototype failed.

The outcome of our two-pronged testing showed us that the approach of using a voice agent to help people with a the more specific task of creating the Advance Decision to Refuse Treatment offered the most promise. The interactions offered by the current state of the art are more suited to a single user and a simple one-to-one conversation, although with the advance of AI, new possibilities for facilitated and complex conversations may well come. The overriding potential of the voice agent in this context lies in the human nature of the interaction, without the human awkwardness of embarrassment and judgement. We also found that taking users through the scenarios in the ADRT had the same effect that we hoped the quiz would in terms of getting user to think about the value in documenting their wishes.

So, what next?

This was a relatively quick and experimental dive into voice technology for a specific application. But the basis for this exploration was on the question of its accessibility and appeal to an audience that other technologies perhaps leave behind. In that respect, we think we have learned a lot about the potential of this emerging tech. We will be continuing to experiment with voice technologies, and over time integrate them into our ongoing projects as alternative and auxiliary ways of accessing them.

If this has sparked an interest in advance care planning, you can see our web-based platform and make a free plan (currently without the benefit of a voice agent) at https://www.amberplans.com

More information

Amber Plans: https://www.amberplans.com
Digital Care Planning: https://www.digitalcareplanning.com
Helix Centre: https://helixcentre.com/project-advance-care-plans

Resources

Some of the resources and tools we used during this project:

--

--