9 Learnings from designing for the voice experience

Published in

DENKWERK STORIES

8 min readJul 22, 2019

How challenging is it to build a voice assistant?

Nowadays building a voice interface is not a difficult task, There are many tools that enable developers and designers to plan, design and develop voice interfaces till metrics and feedback stage. Google’s Dialog Flow and Amazon’s Alexa Skills Kit are a good example of an easy-to-use platform to plan and conceive voice user interfaces. Also there are other platforms like IBM’s Watson Assistant and Snipes.

Other third parties still try to build upon those interfaces to make it even easier for almost everyone to deploy simple skill/action. In these tools, content creators can use flow UIs to create their own chat bots or expand on certain voice assistant abilities. Platforms like Sayspring, Voiceflow, Parloa can offer that, and many more will appear in the future offering multi-modality.

But building up a whole new voice assistant. You would need to build an ecosystem, architecture and a platform with services and components that make a smart voice assistant.

In our journey of designing for a new voice assistant, we faced lots of challenges, tried lots of solutions, had so many breakthroughs and made several mistakes. It was and still is a quest of research, trial and error. We want to share some of our experiences from this journey. Here are nine of these learnings:

1. Voice is for everyone

Speech is a primary element in a voice assistant as we already know. Speaking is the most basic and natural form of communication between humans, and that’s why voice interfaces have high usability compared to other interfaces, or they should have. Voice interface is not an extension of the human body but an integral part of it. Speaking is an already established system in our own existence. Machines are finally rising to communicate with us using that system, so finally we (everyone) can communicate freely with a computer.
Since a speech system is mainly operated by voice, it should be universal by nature and accessible by everyone. However, there are specific user groups that depend mainly on voice interactions in their daily life, for example, visually impaired people or blind people, utilizing a voice assistant to their needs is a good starting point for research and development as we can conclude that their use of voice interfaces will be different, and their use cases should be conceptualized in another way than how current voice assistants work in the market.

2. Personality issues

Voice interaction with a machine is different than traditional ways of interactions i.e using GUIs. Speaking is a very human act, when you talk to a machine or even a non-human ‘object’, you assign humanistic attributes to it (Some users would describe their voice assistant as ‘friendly’ or even ‘stupid’). It’s no longer an abstract thing, it’s an embodied object with feelings and imposed identity. Most of the time, speaking to a computer makes it feel like a person with a personality.
A personality can be expressed in multiple ways such as form of characters (tonality — pitch — attitude) and behaviours. Defining a personality or a ‘way of talking’ is a challenging task in creating a voice assistant. Personality traits and characteristic wordings could offer the right voice for your voice assistant.

3. DTM?

Today’s design processes and workflows are well thought of more than ever before. Being in agile or lean workflow implies certain roles and tasks that are sometimes well defined i.e a scrum master or a product owner. However in complex projects that involve creating new or non-existing products, and when there is not enough experience yet , it is advisable to have DTM (Deep Team Member) who is participating actively in their tasks in the team, but also having a cross-sectional view on other team members tasks and all teams in terms of technical knowledge.
The role of a DTM is different than a product owner’s or others’, in the sense that a designer (who carries the end user’s worries as a DTM and also is a T-shaped-skilled person) is required to constantly learn and interact with pieces of information that might be lost or has not been communicated. Technical limitations are essential to tackle new challenges of these types of projects.

4. Testing, testing, testing

Yes testing, testing, testing. We are still in the bright era of trying out and training our smart machines. Nothing enriches our knowledge more than testing, especially when it comes to understanding the human side of interacting with a machine. Voice is a very natural interface to use and design, yet it comes with a package of having a conversation with topics of linguistics, short-term memory, cognitive load, anthropomorphism and many more. Should your voice assistant have one voice or multiple voices for every skill/action? We can answer that in many ways and the first one is through user testing. It became obvious that we need other ways to do user testing for VUIs. You can read our article on that here.

5. Multi-modality is important

There is no doubt that today’s connected devices have enabled us to think about more than one touch point for a product or a service, this is no different when designing for a VUI. Whether it comes from a user’s input such as giving voice commands, or from an output like voice response and action. We can see the web of connectivity, for example, when a user gives a voice command to turn on a light, they expect the action to be completed, more so than receiving a positive voice response. That’s why it’s advisable to design a voice response that delegates the confirmation to actually turn on the lights by saying ‘Turning on the light’, while the lights are turned on instead of just saying ‘Ok, light is turned on’.
It is a holistic experience after all, so it is very important to think about how this voice experience will connect to other interfaces including screen-interfaces, especially in error recovery and how it can help the user amend problems. Voice interactions should become invisible (It is already invisible to the eye) by being embedded in the user’s daily life. You don’t need to pick up your phone or go to your computer to complete a task anymore.

6. Conversation flow vs Dialogue snippets

There are many tools out there that can help you with visualising the dialogue structure in a tree format, this structure helps voice designers and developers put their ideas and thoughts in a controlled overview. Having a decision tree or system architecture for a VUI is a good first step in explaining the written dialogues, not only that but it is also very important to put other touch-points in an enlarged flow/tree to see the voice experience as a whole. You can achieve a lot on the level of one-turn dialogues and you can have some rules for multiple-turns dialogues (Double or triple trees) but sometimes users end up in a loop and they have to start from the beginning. This is very limiting in the topic of turn-making.
Natural conversation in the human-world doesn’t usually take one path, It is a continuous, multi-turn conversation and sometimes open-ended. A good conversational interface should offer the possibility to communicate or answer almost any request from the user from anywhere without the need to follow a certain path. This is where flows are failing and why some chatbots or VUIs may seem not smart enough for some users.
Dialogue snippets are requests/intents that reflects certain VUIs features, every snippet is fulfilling a certain voice request. This method becomes useful when we can compile and build each of them to work together as a module, while giving the user the freedom to communicate with the system in a natural way.

7. Application tree Syndrome (File — Open — Browse)

Due to the current implementation nature in today’s voice systems, users usually have to say an invocation name or service name in their voice commands (except for some sponsored cases) which makes them carry along the mental model of ‘starting a computer application’ all the way to saying a voice command. For example “Alexa, Open Yoga Guru”, rather than speaking out the intent behind the voice command i.e “Alexa, Can you help me doing some mediation?”.
That behaviour is emphasised in a familiar but not very convenient mental model which is : File → browse → open. For a long time Interfaces have used hierarchical elements to display and access different elements of it. We have to navigate through different menus and submenus to get tasks done, luckily shortcuts were added later but not for everything. We still use right click to access more options, however in a VUI, theoretically we don’t need any of that. Users should be able to jump whenever and wherever they want to complete their request. But since NLU needs to have a high confidence of what users mean when they speak, we still need to somehow follow the same mental model which we call ‘application tree syndrome’.

8. Is it a screen-less experience?

VUI is an invisible interface by definition, this is a tricky statement. VUI is not jus another type of interface which can stand alone, we can not only depend on VUI elements as the only interaction, visual communication is still an essential part of the whole system. It is due to the fact how humans perceive, process and store information. There are limitations to our sensory apparatus whether it is auditory, visual or touch, how many shopping list items can we hear, process and memorize? How many sentences in a story can we hear and understand in a row without a pause? How many news articles can we swipe through a day? these and many other questions need to be looked at and assessed, for what the user can do through voice commands.

9. Comprise for the user

This is the most important challenge but not the last. The idea is that technology advances at a faster pace than our full understanding to it, and how it affects us. Hence we — as designers — get to tame technologies to the users’ needs on a market scale, but we don’t get enough time nor resources to explore and discuss what technologies have to be developed in the first place. However in this taming process we shouldn’t get lost or forget that in the end voice experience is the most human-centred design experience ever. Make the user the centre of attention and compromise for humanity.