Using “Wizard of Oz” testing for voice apps

How lo-fi voice prototyping helps teams iterate quickly.

8 min readJul 6, 2021

My team started designing and testing conversational UI experiences in 2019. Two years later, Wizard of Oz (WOz) testing is an early-stage protocol we use on all voice products. This article is going to explain why, how, and discuss some of our learnings.

Voice design needs research

The first thing I have to do on a voice design project is to state categorically that we MUST do user research and testing. There simply is no room to chop it out of the process for reasons of budget or timing. If a client or stakeholder insists on cutting out research, we need to insist on not designing a voice app.

If someone insists on cutting out research, we need to insist on not designing their voice app.

Voice design is more complex than most standard UX design projects. In part, this is because it is (generally) a zero UI product. Even if you design internally consistent voice responses, there is no initial visual affordance to trigger users’ mental models and so user expectations (at least at the beginning of a voice experience) can be literally anything.

The other part of complexity is the voice design iceberg. With voice apps, the more call/response interactions you build into the “happy path”, the bigger the iceberg of response, logic, journey correction, and error states that sit beneath the water line. It is essential that you are aware of this when dealing with stakeholder requests.

And as we all know, the more complexity; the more risk. And the more risk, the more reason to conduct user research.

The more risk, the more reason to conduct user research.

What are we testing?

Voice experiences (usually) do not have a Graphical User Interface (GUI), therefore there is not a “UI Design” phase that includes component libraries etc.

The “design” phase of the UCD methodology, therefore, is almost entirely replaced with designing the product’s information architecture, task, and dialogue flows, with teams then moving straight into the “build” phase of either prototype or finished product.

However, whether you are designing flows to be built by a dev team, or straight into a tool like Voiceflow to push live, you still need to have a thorough understanding of user needs and mental models to ensure product success, just as you would with a GUI.

The top two ways to do this are:

Upfront user research — During the discovery phase you need to be conducting generative and exploratory research to understand users’ mental models — literally what words do they use to think and talk about this type of product and service.
User testing — Before investing in final build, you need to test and iterate using evaluative research. Does the flow of call/response work? Are the utterances correct? Do the invocations and intents match what users are likely to say? Do error state journeys successfully course-correct the user journey?

In the second scenario of user testing, you could design every flow and build every interaction as a working prototype and then test it but — you know what I’m going to say — this is risky. It’s risky because you could be wrong, and you could have walked down a long and expensive route to designing and building your wrongness.

Testing early and often on lo-fi prototypes is always* going to reduce risk. And testing early and often with voice is even more critical that with traditional GUI.

The comparison I’m drawing here is:

GUI — You test lo-fi, greyscale wireframes, and protos before moving into finished UI because it’s quicker and cheaper to iterate. Wireframes are designed to be thrown away.

vs.

VUI — You test dialogue flows with WOz like they’re paper prototypes, before moving into QA’d working prototypes because it’s quicker and cheaper to iterate. Flows are designed to be thrown away.

What is WOz testing?

WOz (Wizard of Oz) is a type of experiment traditionally conducted in HCI studies, particularly when the user is interacting in a conversational or bi-directional format with a system — particularly a system that is appearing tto be or mimicking some level of intelligence, AI or even outright pretending to be a human. (which we know voice and chatbots should not attempt to do).

In this method, the participant talks to some kind of interface, experience, system or machine (the curtain) and perceives that the system responds. In reality, all responses are controlled by the researcher or “wizard” sitting behind the curtain.

How does it apply to voice design?

I first came across this method as applied to voice interfaces in Cathy Pearl’s book, Designing Voice User Interfaces (2016), where it was referenced as a method used in testing the original IVR interfaces.

In other studies, HCI academics have attempted to use WOz to test AI-based language apps, but this has still required some level of additional prototyping layer.

However with modern voice apps, the quickest way to test early with WOz, is just to use sketched out dialogue flows — the equivalent of testing paper prototypes or very early stage navigation structures.

How to do it?

Obviously, you could go to the lengths of pre-recording every response and manually triggering audio files, much as the original HCI researchers did, but even that is a pile of effort compared with just testing the flows.

Prepare your flows

So, as one would with paper prototyping — just draw out the dialogue flows. Do it with pen and paper or on a whiteboard, or if working remotely Miro or Omnigraffle.

Make sure that the team has worked on them together so that a) everyone is familiar and b) the flows have been stress-tested by multiple brains.

In person testing

Seat the researcher and the participant back-to-back. The participant will invoke “the app” and the researcher will answer according to the flow (or script) in front of them. If you’re a research reading this, I recommend literally following a piece of paper with your finger tip.

Note: you could use a divider screen, but that’s probably going to make it even more weird for the participant. It’s much easier if you see the person, then sit back to back because that’s an equal loss of perceived control for both, which can reduce participant stress.

Remote testing

When working remote, you can screenshare your digital whiteboard or Omnigraffle file with observers via Zoom/Teams, and then telephone dial-in your participant. This way, your observers can follow the flows on screen, and you can illustrate your focus area using your mouse cursor.

The importance of note-takers

Whether in person or remote, your researcher absolutely has to have a note-taker. It is simply too complex for the human brain to follow the complexity of the flow when responding to a user in real time, without the need to also record their own observations in real time.

What are the benefits of this method

As above, this method allows you to test early and reduce risk:

It’s relatively quick and lo-tech to set up
It’s no more expensive than standard user testing
It doesn’t require extensive prototype build, which is useful if you don’t have a voice prototyping tool yet, or a dev team on board
Early-stage feedback will highlight the biggest issues and errors in your task flows
Early-stage feedback on dialogue flows will help validate your discovery research
The work of designing and testing flows will potentially speed up your prototyping phase
It fits into a design sprint if it has to — just remember to keep the happy path as simple as possible, so that you can control that iceberg.
Because it doesn’t require software in the first iteration, it can be used without tools (beyond a pen/paper) and by users with no “tools” skills (zero learning curve) which of course includes stakeholders new to the voice design process — essentially it’s accessible.
It’s fun!

Findings so far

When using this in our team, we’ve learned a few things:

The researcher needs to have the architecture (flows) internalised — one does not simply pick up a ton of complex flows and “be the wizard”. So it’s important to a) involve the researcher early on and b) allow them time to run practice tests
You need to keep the “happy path” scope under control — especially for early-stage prototypes. Again, the app will become an iceberg later anyway, but for early-stage testing and for your researcher’s sanity, try to test one or two happy paths with all their error journeys.
Discovery research makes WOz testing more valuable — as with standard UCD projects, upfront research reduces the risk of you designing something users don’t want or can’t use. In this case, upfront research helps to make dialogue flows that already follows users’ mental model of a task and appropriate vocabulary. You really don’t want to be discovering that the words you’re using are wrong, in an app that is essentially.. made of words.
WOz adapts well to remote working — If anything, it’s maybe easier than in-person, because you can telephone-dial a participant in, while researchers video broadcast flows and architecture to observers. The observers can see what the researcher /wizard (voice app) is doing, as well as the participant
Users do weird stuff — As all researchers know, humans are somewhat unpredictable and no matter how much research you do, participants will always surprise you. With voice, it’s the things users say and how their brains work — especially in voice apps where there are usually no fixed or consistent UI affordances to guide them. This really helps to sell research to stakeholders when they see this kind of testing, because they learn it’s almost impossible to predict user behaviour.

WOz is fun

One of the biggest benefits of WOz testing is that it’s a collaborative method that brings design teams and stakeholders together. It’s a bit unusual compared to standard usability testing, but then so is voice design compared to your standard app-and-website projects.

Making it fun and hacking together scrappy ways to test early and often stops research being perceived as formal, expensive and time-consuming.

We sometimes underestimate the impact of research, even in small amounts, to bond a project team and open their minds to real user needs. And you’re definitely going to need an open mind when designing for voice products.

* “Always” = it depends. Almost always; in 99% of cases. I can think of edge cases where testing too early would be bad. But they are so very few, I’d have to be a bit creative and pedantic to document them here. So as always, #ItDepends.