Prototyping for Conversational Interfaces

Designing a Low-Fidelity Prototyping Tool for Voice-UI

8 min readDec 10, 2017

Over the last months we have seen lots of discussions about conversational interfaces. Of course, this is also due the availability of Amazon Echo and Google Home devices in more countries and of those underlying services in more languages.

At adorsys we started to look into designing conversational interfaces as well. First we took a look at designing a skill for Alexa concerned with banking scenarios. We started out with user requirement engineering tasks (survey, interviews, personas and use cases) and conceived scenarios we wanted to evaluate in low-fidelity prototyping sessions — but wait! …. How to prototype for conversational interfaces?

Why does prototyping for conversational interfaces differ that much from prototyping for graphical user interfaces?

Do we not want to be in a dialogue with the users when designing the flow for our mobile apps and desktop apps as well?
Yes, we do, but natural conversation is different. The user is more flexible in expressing her needs and demands. Conversation might be less focused — not strictly heading into a pre-determined direction (or one determined by the UI elements of the GUI). Conversation seems more personal through the way one expresses oneself individually. And conversation is less formal. Thus we can expect to have a wide range of variations of how our future users might address our conversational service.
A graphical user interface tells the user where she has to click to do something. This might introduce friction into the interaction, as the user might not find the link to the action or information she is heading for.
A conversational interface — especially a voice interface — does not provide this strict guidance. Conversational UI is all about lowering the friction between the user and interface.

On the other hand we need to find a less formal modality to express ourselves when we start the conversation with our users over a conversational interface. Spoken language differs from written language. We need to be sensitive to such differences, otherwise the conversation will not feel natural either.

Therefore, early user participation — also by engaging future users in low-fidelity prototyping to test scenarios with a divers set of potential users early and often— is very important.

A paper-prototype for conversational interfaces?

When developing smartphone-apps we might bring up a paper-prototype to easily and affordably test such scenarios — validating ideas and withdrawing those that do not work as expected. At least for Voice-UI (Alexa, Google Home) a paper-prototype does not really work. Nevertheless, we can use this analogy to formulate what behaviour or features we expect from a low-fidelity prototyping approach.

it should be easy, fast and affordable to make changes
it should invite to change and rule out ideas without hesitation (low emotional involvement)
we want to be able to test early and to fail early
as we want to test a conversational interface we aim for the possibility which allows a conversational flow to evolve during the test. This effectively means that ‘the prototyping tool’ should be able to understand the test-user.

Prototyping approaches, tools and services

If we look around, following the discussions one gets the impression that a divers set of prototyping approaches is applied currently. We find Wizard of Oz (WOz)-approaches, there are services as Sayspring, some use DialogFlow (formerly Api.ai) or Wit.ai.

At the time we took a closer look at those services and technical solutions they did not support the German language yet. So the natural conversational flow was not really possible. Furthermore, we had the feeling that with those technical solutions the system would only understand what we told (or worse: coded) it to understand.

As an example: Imagine we want to design a system which would allow you to look up train connections. If we tell the system to understand:

I want to travel from A to B.

in most cases it will be able to understand this. Now we also come up with I want to drive from A to B, I want to go from A to B, etc.. However, in case one of our test-users says:

I want to travel to B via C.

Certainly, this is a valid instruction one can give to human operators selling transportation tickets (as the from-part can be eventually inferred from the current location anyways). However, those technical services will not be able to react to this correctly in case we did not tell them beforehand. Nonetheless, we want to evaluate an arbitrary conversation and not a conversational scenario which is too restricted to too specific, singular phrases.

Wizard of Oz to the rescue?

Thus we thought about using the WOz-approach. Clearly the great advantage of this approach is that the WOz is a human thus she will understand the needs and demands the test-user expresses and can formulate (or choose) an answer. Therefore, a conversational flow is within reach.

A great advantage of this setup is that we can start with relatively few phrasing variations of what our potential users might say — as the WOz can more easily understand our test-users. This means we need to spend less time beforehand trying to construct collections of possible phrasings that will not be complete anyways. Furthermore, we can collect those phrasings directly from our test-users.

The most simple WOz setup would be to position the test-user, the test-operator and the WOz at a table and evaluate the scenarios. However, this setup has some obvious disadvantages:

the WOz will see the test-user — facial expressions and gestures. This is information that conversational interfaces will no have, at least at the moment.
the test user sees the WOz and thus it is not believable that she interacts with an ubiquitous technology — thus, the interaction relies on a totally different mental model.

Well, those two issues can be fixed …. we can either position the WOz behind a curtain — or use VoIP (voice over IP) and place the WOz in different location.

Technically more refined setup — different locations, transmitting audio via the network

However, further disadvantages exist (and they also exist with the more simple setup from above):

the WOz can hear the whole conversation, everything, a feature the conversational interface will (and should) most likely not possess.
The WOz has to concentrate all the time. She has to answer concisely and correctly. Nevertheless a slip of the tongue can happen here and there. This might impact the test.
Still, as the test user can identify the WOz as human, the interaction relies on a different mental model.

Therefore we went a step further and conceived a system that avoids those disadvantages. On the one hand we use an wake-action (instead of an wake-word like ‘Alexa’) so audio will only be transferred to the WOz when the test-user really addresses the system. On the other hand we use pre-rendered phrases (i.e. audio files) that get triggered by the WOz (or simulator-operator) to provide responses to the test-user’s requests.

What do we have here?
The test-user together with the test-operator is located in the test lab, the simulator-operator is located in a different room. The test-user sits in front of a computer which can transfer audio (the spoken requests of the test-user and reversely the responses from the WOz-system) forth and back with the computer running the WOz-interface. Audio is streamed over a network connection. Using a button (the ‘LISTEN’-button in the diagram) the test-user can actively engage in interacting with the system (wake-action). Only while this switch is activated audio gets transmitted to the simulator-operator.
The simulator-operator has a panel on which she can trigger pre-rendered phrases — a bit like a DJ. Via tabs different test scenarios can be selected / answered. Audio gets sent to the test-lab so that the test-user can hear the response.

As we still have a human operator who responds by triggering phrases we are still able to understand each and every request of the test-users.

One might argue that this approach has the limitation that the operator does not freely respond to the test-user’s demands which introduces restrictions. However, there are two reasons why we consider this a minor concern:

firstly, we want to test scenarios. Therefore it seems reasonable that we provide clear instructions to the test users (in a way that we avoid priming the users!).
secondly, a test session would be quite stressful and error-prone if the operator had to construct concise answers to arbitrary requests. If we consider a banking scenario — asking about the account balances and making transactions, updating those balances, etc. — we can imagine a plethora of options possible.
thirdly, we designed a prototyping tool-chain which allows us to make changes to the test scenarios affordably and quickly (this will be introduced in another post).

Conclusion

We discussed the approach and the tool we developed to evaluate test scenarios for conversational interfaces — more specifically for voice assistant technology. The aim we wanted to achieve is being able to test early and fail early! Thus, we think it is important to have a tool specifically for low-fidelity prototyping.

We evaluated different approaches, tools and services but decided that those did not support us well enough. Therefore we formulated our expectations and conceived a tool that is flexible enough to easily, quickly and affordably test and validate scenarios we constructed from the outcome of our user requirements engineering activities. We think that these expectations can be realised the most effectively if the need for coding is minimal.

Besides allowing to evaluate scenarios, the tools supports close to naturally evolving conversations within the context of the scenario. It enables to test in early stages of the project because we do not have to teach technology arbitrary phrasings of intents. Finally, it supports us to collect those phrasings from our test-users.

A couple of people contribute to the ‘research projects’ of the CUI-team [converational interfaces research team] @ adorsys:
Steffen Blümm | Technical Lead iOS / CUI
Julian Wölk | master student
Martin Bauer | master student
Isabella Thürauf | bachelor student

We presented this approach at the MuC 2017 (Mensch-und-Computer-Conference 2017, Usability Professionals track), Regensburg, Germany, at the Mobile Media Forum 2017, Wiesbaden, Germany and during the UX Congress 2017, Frankfurt, Germany