Designing multimodal interfaces: Voice + Screen

Published in

Bootcamp

4 min readJul 9, 2024

Voice assistants are changing the way we interact with technology. However, if the experience is not seamless enough, users will be reluctant to use it. Many teams opt for creating interfaces where voice and visual features enhance and complement each other. This article shows an overview of the process of designing the best experience when voice and interface work together.

Play button, sound, microphone and loading — Some variables that appear in voice experiences: audio control, microphones and load states

Analysis

Voice experiences have unique limitations and characteristics and that means that should be crafted and refined in a different way. The most important aspect is to understand how the voice and screen will coexist.

Is voice essential?
Is it only supportive?
Is it an alternative?

It is important to refine each element but also to analyze how everything is connected in the overall user experience, ensuring that voice functions do not spoil screen functions, and vice versa.

An excellent example of this is the ChatGPT-4 voice interface, which enhances the voice experience with a well-crafted visual design where animations play a crucial role. These animations not only make the assistant more human and the assistant’s states more comprehensible but also loading times feel shorter and the experience is more fluid.

Prototyping

Here are some tools that will help you refine the experience fast from design to code without implementing the overall functionality. Before choosing a type of prototype, validate the time, effort, and outcome that each could provide.

You can start creating flows, iterating ideas or defining states using a design Software like Sketch or Figma. Then, move on to a more realistic scenario with no-code prototypes or maybe programme a simplified interface.

For example, Protopie is a no-code prototyping software that offers a variety of specific audio and voice functions such as:

Commands detection
Text to speech
Speech to text
Sounds player
API connect (For paid plans)

There are hundreds of APIs available to enhance voice interactions, test ideas and create fast prototypes to determine what works best. Some examples are:

Whisper that enables advanced speech recognition
Eleven Labs for voice cloning
Large language models like ChatGPT facilitate natural conversations.

Variables

As you go deeper into the analysis of the product that is going to be created you will discover constraints and limitations that are not typical. Here are some of them:

Listening: Commands are words or sentences that trigger actions instantly or after finishing speaking. Does the assistant listen automatically? Does the user need to tap on a microphone button manually? Or perhaps a long press is required? There are infinite ways to design the listen trigger behaviour. It’s important to provide the users with instructions about how to interact if it is not intuitive enough.

Three screens illustrating the flow of using a voice assistant with an app. — Example of states of a voice assistant using simultaneous animations of a microphone and live speech to text components

Feedback: Users need to know if they have been understood through live transcription, sounds, states or error messages. For example, background noise or long sentences can be challenging for an accurate proccessing. It is really important to take into account software limitations. Think about using the screen as a support with visual indicators, or sounds, as the user might be listening the voice response without looking at the screen. Moreover, information, sounds and responses are not instant. Communicate to the user what is happening by defining different states of elements.

A screen displaying a recording microphone button. — Recording state of a microphone indicating time and how to cancel the recording.

Response: Is the answer going to be displayed or only played? What happens if it is too long? Does it have limitations? Test your prototype to support all formats by adding different examples and consider software limitations. A huge range of audio variables can be controlled. Evaluate if the user will need to adjust audio volume or pause and resume it by thinking about accesibility and user needs.

Example of a voice assistant interface that displays the user input but also the assistant interaction

Consider the entire experience: For example, if you are loading information, maybe some actions such as sending or recording another message need to be blocked. Don’t isolate functions if they are working together and try to prototype the visual and voice functions together as soon as possible.

Voice and LLMs: ChatGPT voice interface, AI Pin, Gemini, Siri
Protopie example using Chat GPT4 and Eleven Labs to clone a voice
Protopie course: Protopie Masterclass in advanced voice prototyping
Conference: Looking beyond screens and The secret sauce of conversational AI
Book: The best Interface is no Interface

Conclusion

Voice and visual functions can enhance user experience if they are well refined. It is not only about working on each function, it is also about integrating them correctly in the overall experience. A good approach is to start prototyping in near-real scenarios as soon as possible.

Designing multimodal interfaces: Voice + Screen

Analysis

Prototyping

Variables

Recommended

Conclusion

Written by Laura Reyes