Usability testing for Voice AI products

Published in

Cisco Design Community

9 min readMay 12, 2019

Before diving into the details, a little background about my work and experience. I work on a workplace Voice AI product called Webex Assistant. It is built upon our collaboration solution hardware. It’s not a voice alone AI. Meaning it does have GUI as part of the interaction.

Just like many others, I don’t have answers for everything and I’m mostly learning on the spot. But here’s what I’ve learned from this project and hope you find it helpful.

1. Define your objectives

Why are you about to run this usability testing? What state is your product or feature at now? Do you have a working product? Or you’re still at an early stage and want to test your Minimum Lovable Product (a better alternative I just learned recently for Minimum Viable Product).

A well-defined testing objective list should also be discussed with your PM and eng team. If you have an in-house NLP team, you also need to bring them into the planning phase. These early discussions will not only guide you to take the most advantage of all the time and energy you’re about to put into this testing but also will make your findings more digestible for the larger team.

We started defining objectives from the following areas:

Interaction flow: Do people know how to initiate the task? Do they understand how to navigate and move forward? Are there any new interaction branches we missed and how could we navigate the user back to the happy path? Do people agree that the product or feature is more efficient than their current workflow or solution?…
Natural Language Responses (NLR): Are the voice prompts and responses clear and informative? Is the AI giving the right info at the right moment? Do people understand the error messages?…
UI: Is the correlated UI clear and effective at guiding the users throughout the interaction? Do people find the UI competing with the voice prompts?…
Automated speech recognition (ASR): In our scenario, we transcribe the user’s command on the UI. Are there miss-transcriptions? Any recurring error patterns?…
Natural Language Processing (NLP): Does the AI understand the ASR transcription both when it’s correct and wrong? Can we get the right intent?…

Choose what areas you want to focus on and what your hypothesis is can help to make the rest of the testing a lot easier. Also, it’s important to build the team’s expectations that qualitative interview methods can’t answer all the questions.

2. Define the target audience and recruiting

You should have some high-level user persona defined already. But if you’re testing a specific feature which mostly happens to a subgroup of your primary users, you might want to think about what kind of testing participants can give you the most helpful or critical feedback.

For example, if you’re trying to understand on a high level what people expect a voice AI to do in your product category, you might want a good mix of people who are frequent voice AI users and not. In our case, we often test meeting room features for collaboration in a workplace environment. We usually recruit participants who primarily work in an office, since those will be the primary users of the feature.

We usually recruit two types of participants: internal vs. external.

Internal participants usually fit better with our targeted users. In our company, employees use our own products heavily. Their current workflow exists in the ecosystem of our own portfolio. This type of participants usually has more direct and relevant feedback on our product. However, the disadvantage is that they are used to how the ecosystem works. They tend to think more “in the box”.

External participants can give you fresh perspectives which usually can’t be offered by internal participants. For our case, we design collaboration products for enterprise users. External participants really give us a better understanding of how other companies and industries work, how people interact with different kinds of tools, and how our product can make their work easier, or maybe not 🤷‍♂️ They also usually break the interaction flow in the most unexpected and illuminating way.

3. Define your testing flow

Now it’s time to be creative. A lot of research methodologies for GUI are still applicable for VUI products. For example, recently we used “card sorting” for testing our help menu and see what’s the most natural and logical way to present our command list.

You may find many other articles around how to define a testing flow so I won’t get into details here. However, one thing to point out is the warm-up session.

You probably know why you need to warm up the participants before getting into the meat of this testing. Especially for VUI products, I find it very important to make it clear to the participants that we’re not testing them but the products. Because of the immaturity of the technology, participants will run into errors way more often than using a GUI product. Most of the time people feel embarrassed, self-conscious and blame themselves for errors, which is not helpful to get the most genuine feedback.

I always ask questions around if they use any voice AI products currently, how often and why so. That is also a useful piece of info to uncover some potential patterns. Another thing I find useful is asking for their expectations before actually interacting with the product. Now is the sky-is-the-limit moment. I usually get some really wild, ambitious yet inspirational ideas out of it.

4. Prototype and setup

Based on the objectives and test flow you defined, you now need to decide to use a prototype or working product.

Based on where you’re at with the product or the feature you’re developing, sometimes you don’t really have a choice. However, they have their unique pros and cons.

Fake prototype:

First of all, when I say fake prototype, I mean a keynote prototype. I know you might be shaking your head or rolling your eyes now since there’re many tools for prototyping voice. However, I personally don’t see those products are responsive or rich enough yet. Especially for our products, it’s VUI + GUI, so most of the time I need corresponded comps to be presented at the same time.

How I do it is that I define the main screens with GUI and voice prompts. I use Amazon Polly 🦜 to download voice prompt audio files and attach them to each slide. Within testing, I project my screen to a bigger screen in the testing room. I’ll have the presenter’s view on my laptop and have the navigation menu open. So that based on the participant’s behaviors, I will manually navigate to the right screen.

Pros: A fake prototype is very helpful at testing interaction flows and NLRs. Since you can manually manipulate what’s presented to the participants, you can also easily “fix” any unexpected errors or missed flows. So that within testing, participants are still able to experience the best flow it can be. You just need to collect all those unseen behaviors and have the common ones supported after.
Cons: Because it is fake, you won’t get a chance to actually understand how the ASR and NLP would work. Can your voice AI understand all those commands correctly? You’ll have to double check after.

Working product:

Just as it says, you can also use a working product for testing. This means that you have already designed the interaction, NLRs and UIs. Plus having the engineers implemented. This should mainly happen after some previous validations since it’s tons of work to develop a voice feature.

Pros: You get the chance to observe how people interact with the real product and most importantly, validate if you’ve done a good job at discovering enough error paths and defining flows to guide your users back to the right track. You can also validate if the ASR and NLP are sufficient at understanding user’s intents correctly.
Cons: Time and resources required. ️I also wouldn’t recommend this during the early development stage. You probably missed out a lot of error scenarios. That will mostly have your participants stuck at the beginning of an interaction. You’ll miss the chance to get a bigger picture understanding of how people will interact with the product or feature.

When you use a working product, I also recommend doing dry runs before the real testing. Due to the complexity of voice AI products, you may run into some unexpected tech issues which may make your real testing way less effective.

5. Testing

You’ve got the process planned out at the previous stages. Plus the basic disciplines are very similar with how you test a GUI product so I won’t explain much here.

One note to call out though is that when you give the participants a task, always avoid any words or phrases that they might use in a voice command directly and get biased results.

For example, one of our use cases is about joining a scheduled meeting in a meeting room. When we gave the participants the task and ask them to initiate the interaction, if we ask them “Oh, so how would you join that meeting you just scheduled in this room”, people would borrow those words directly. That isn’t helpful at discovering what the command variations are and validate if our AI can understand all.

I also always ask the participants if they’re okay with being recorded. I personally find taking notes within the testing makes the participant less engaged. Plus, showing video clips to your team during the readout will also build empathy more effectively and make them understand users’ real pain.

6. Consolidate insights

If you had a spreadsheet prepared from step 1, now it’s time to take it out. If not, still totally ok. You might have already found out some patterns from doing the testing. You can also create one while you’re re-watching the recordings.

In addition to the questions you had, it’s also good to keep track of things like

Wake word false positives or negatives: how many times the verbal wake word worked when it’s not supposed to, or failed when it’s supposed to.
ASR miss transcriptions: how many times did your AI transcribe the voice commands wrong.
NLP issues: can your AI always understand user’s intents correctly? What are the reasonable commands that should work in a certain way but your AI failed?

Video is a powerful storytelling tool. Don’t forget when you’re re-watching the recordings, you might just want to cut save clips to demonstrate the issues.

Depends on the scope of the insights you’ve collected, you should also create some actionable items so that the team understands what the next is.

7. Readout presentations

In the beginning, I suggested to discuss and define the objectives with cross-functional team members. Now it’s time to invite them back for a testing readout, and maybe more people from the product team. People are usually interested in how users would interact with a product they’re creating.

Again, don’t forget to add some video clips or even with quotes to make the whole readout more convincing.

8. Follow-ups

The readout presentation isn’t the last step of a testing cycle. I usually find people very engaged during the presentation and start discussing potential solutions. However, time is limited and usually, we won’t come to any conclusion within the presentation meeting. I have to interrupt the discussion and move on to finish the whole readout. That means follow-up discussions and even meetings are needed.

We love hosting bug filing parties with a few members on the team. That will keep track of the most obvious bugs. However, as for any new feature or flow improvements, you might need to go back to the double diamond process.

I hope this is helpful. If you have any ideas on how I can improve our testing process, please let me know :)

Some of my other posts:

Voice User Interface and How the Technology Works

Product/UX designers need to understand technology’s feasibility and collaborate with engineers to create valuable…

medium.com

Intro: Voice/Conversational User Experience and Why

From Interactive Voice Responsive systems letting you know account balance, to Siri replying to messages on your phone…