Designing multimodal interfaces: Voice + Screen
Voice assistants are changing the way we interact with technology. However, if the experience is not seamless enough, users will be reluctant to use it. Many teams opt for creating interfaces where voice and visual features enhance and complement each other. This article shows an overview of the process of designing the best experience when voice and interface work together.
Analysis
Voice experiences have unique limitations and characteristics and that means that should be crafted and refined in a different way. The most important aspect is to understand how the voice and screen will coexist.
- Is voice essential?
- Is it only supportive?
- Is it an alternative?
It is important to refine each element but also to analyze how everything is connected in the overall user experience, ensuring that voice functions do not spoil screen functions, and vice versa.
An excellent example of this is the ChatGPT-4 voice interface, which enhances the voice experience with a well-crafted visual design where animations play a crucial role. These animations not only make the assistant more human and the assistant’s states more comprehensible but also loading times feel shorter and the experience is more fluid.
Prototyping
Here are some tools that will help you refine the experience fast from design to code without implementing the overall functionality. Before choosing a type of prototype, validate the time, effort, and outcome that each could provide.
You can start creating flows, iterating ideas or defining states using a design Software like Sketch or Figma. Then, move on to a more realistic scenario with no-code prototypes or maybe programme a simplified interface.
For example, Protopie is a no-code prototyping software that offers a variety of specific audio and voice functions such as:
- Commands detection
- Text to speech
- Speech to text
- Sounds player
- API connect (For paid plans)
There are hundreds of APIs available to enhance voice interactions, test ideas and create fast prototypes to determine what works best. Some examples are:
- Whisper that enables advanced speech recognition
- Eleven Labs for voice cloning
- Large language models like ChatGPT facilitate natural conversations.
Variables
As you go deeper into the analysis of the product that is going to be created you will discover constraints and limitations that are not typical. Here are some of them:
Listening: Commands are words or sentences that trigger actions instantly or after finishing speaking. Does the assistant listen automatically? Does the user need to tap on a microphone button manually? Or perhaps a long press is required? There are infinite ways to design the listen trigger behaviour. It’s important to provide the users with instructions about how to interact if it is not intuitive enough.
Feedback: Users need to know if they have been understood through live transcription, sounds, states or error messages. For example, background noise or long sentences can be challenging for an accurate proccessing. It is really important to take into account software limitations. Think about using the screen as a support with visual indicators, or sounds, as the user might be listening the voice response without looking at the screen. Moreover, information, sounds and responses are not instant. Communicate to the user what is happening by defining different states of elements.
Response: Is the answer going to be displayed or only played? What happens if it is too long? Does it have limitations? Test your prototype to support all formats by adding different examples and consider software limitations. A huge range of audio variables can be controlled. Evaluate if the user will need to adjust audio volume or pause and resume it by thinking about accesibility and user needs.
Consider the entire experience: For example, if you are loading information, maybe some actions such as sending or recording another message need to be blocked. Don’t isolate functions if they are working together and try to prototype the visual and voice functions together as soon as possible.
Recommended
- Voice and LLMs: ChatGPT voice interface, AI Pin, Gemini, Siri
- Protopie example using Chat GPT4 and Eleven Labs to clone a voice
- Protopie course: Protopie Masterclass in advanced voice prototyping
- Conference: Looking beyond screens and The secret sauce of conversational AI
- Book: The best Interface is no Interface
Conclusion
Voice and visual functions can enhance user experience if they are well refined. It is not only about working on each function, it is also about integrating them correctly in the overall experience. A good approach is to start prototyping in near-real scenarios as soon as possible.