Excerpt from a new book on multimodal interface design

Rosenfeld Media
Rosenfeld Media
Published in
11 min readNov 30, 2020

--

Rosenfeld Media’s Design Beyond Devices: Creating Multimodal, Cross-Device Experiences by Cheryl Platz, was released December 2020. Below are the blurb and an excerpt of chapter 7 so you can sample it for yourself. You can also check out the book site, which includes a table of contents, testimonials, and all the other good stuff you’d expect.

If you’d like to pick up a copy, you can purchase it directly from Rosenfeld Media and you’ll get a free copy of the ebook when you purchase the paperback.(or from Amazon, if you must, but please consider supporting a small business instead).

OK, the blurb, followed by the excerpt; enjoy!

Your customer has five senses and a small universe of devices. Why aren’t you designing for all of them? Go beyond screens, keyboards, and touchscreens by letting your customer’s humanity drive the experience — not a specific device or input type. Learn the techniques you’ll need to build fluid, adaptive experiences for multiple inputs, multiple outputs, and multiple devices.

Chapter 7: The Spectrum of Multimodality

You’ve now explored the full spectrum of options (currently) available to you when breaking beyond the old-fashioned paradigm of a screen, mouse, and keyboard. But pulling those things together requires a broader perspective. A galactic perspective, perhaps?

In Star Trek: The Next Generation, the officers of the Starship Enterprise use technology to navigate through space, scan and catalog planets, communicate in short range and across solar systems, and even to entertain themselves. The crew may have ready access to a variety of touch panels and physical throttles, but they can also ask nearly any question in spoken language. The system can reply with language in kind, but when the answer isn’t well-suited for a spoken reply, they may move seamlessly back to a screen (or even the Holodeck). They might ask for a damage report and hear a spoken summary, while seeing a map of the damage. By tapping on a part of the ship, they can ask the computer to seal off that specific section; or they can ask the computer verbally to take action.

Starfleet Dreams

Yes, like many technologists I’m obsessed with Star Trek. I’ve immersed myself in that world for years as a member of an improvised parody of the original series called “Where No Man Has Gone Before.” The more I explore the world of Star Trek, the deeper my admiration becomes for its optimistic, (usually) inclusive futurism.

The bridge of the Starship Enterprise (Figure 7.1) is a shining example of what today’s designers often refer to as a multimodal interface. The Enterprise supports multiple input modalities, like touch and voice, and it supports multiple output modalities, like screens, VR, and speech. What’s most remarkable is how effortless it is for the crew to change the way they interact with the ship’s computer and its various modes of input at a moment’s notice.

Figure 7.1 The bridge of the Starship Enterprise in the TV series Star Trek: The Next Generation is the ultimate multimodal interface. Officers interact meaningfully with the system in different ways, depending upon their proximity to physical affordances. Star Trek and the Starship Enterprise are © ViacomCBS

Now that next-generation technologies like conversational user interfaces and computer vision have arrived, the most immediate challenge that today’s multimodal experiences face is the tension between visual interface elements and audio interface elements. No matter how much hype surrounds voice-user interfaces, voice is not necessarily best suited to all interactions.

Early voice-enabled interfaces like the Xbox Kinect relied on a “see it, say it” model where customers could speak the written name of an interface element to interact with it, as depicted in Figure 7.2. In contrast, early automotive systems treated voice and touch as separate ways of performing the same task with no synchronization or support for transitions. Mercifully, those experiences are fading away in favor of more integrated approaches to multimodality.

Figure 7.2 Some early voice-enabled multimodal interfaces like Microsoft’s Xbox Kinect relied heavily on direct reference of UI elements. Photo credit: Howcast.com

Years later, the experience of designing for Amazon’s voice-enabled countertop tablet device, the Echo Show, put us in uncharted territory. As we worked on one of the first devices to attempt a tight pairing between interactive visuals and natural language speech, we faced a number of new interaction design questions. Could we use the same responses as the Echo itself? What level of display was appropriate? How did an interaction with the Echo Show differ from the Fire TV? How would we help customers switch between physical interactions and spoken interactions?

As we waded more deeply into the product design process for the Echo Show, it became evident that the introduction of the screen to what had previously been a voice-only interaction was far from cosmetic. This particular combination of screen and natural language interface was yet another new paradigm. It seemed silly to read a full forecast when the screen could display it so much more efficiently. Some requests, like “show me stock prices,” might not even warrant a spoken response due to the context.

Coping with this added complexity required us to make intentional choices about the relationship between our primary input and output modalities. Over time, those choices became patterns, and my first multimodal interaction model was born. But much of my thinking has evolved since those early days when I, like the industry, was still quite device-focused.

Before you begin designing specific interactions for your experience, you must first make a conscious choice about what multimodal interaction models your product will support.

Dimensions of Multimodal Experiences

Early attempts to chart these new voice-enabled experiences placed a heavy focus on the voice aspect of the interaction. At Google I/O 2018, a “Multimodality Spectrum” with a single axis was presented: the support for voice interactions, from voice only to no voice.

A single-axis “multimodality spectrum” depicted in “Design Actions for the Google Assistant beyond smart speakers” from Google I/O 2018.

While voice is undeniably a critical element of many multimodal interactions, simple support of voice features isn’t necessarily the defining feature of multimodal interactions. Furthermore, the terms voice forward and screen forward are inherently device-centric.

In order to create a more robust interaction model that will stand the test of time, take a step back and consider the human impacts of multimodal systems. While you’ve explored many input and output modalities thus far, there are two dimensions that have the greatest impact upon your customer experience:

  • Proximity: The typical or average distance between your customer and the device(s) involved in the interaction.
  • Information density: The amount of information presented to your customer in a typical interaction. In this case, information includes length and complexity of spoken prompts; any visual information like images, lights, or written text; and any tertiary information provided over channels like haptic feedback.

To understand what multimodal interaction model applies to your experience, begin by asking yourself these two questions about those critical dimensions of the experience:

  • Where will your customer be in relation to the device? (See Table 7.1.)
  • How much information is presented to your customer during a typical interaction? (See Table 7.2.)
Table 7.1

As discussed in Chapter 6, “Expressing Intent,” the nature of the input sensors you’ve selected will directly impact the assumptions you can make about the distance between your customer and the device they’re interacting with. Note that close proximity is determined by the location of the input sensor, which is not always on the primary device, as seen in streaming devices with microphone-enabled remote controls.

Table 7.2

High-density information is often provided on high-resolution screens. Multimodal rich output also typically includes robust voice and sound interactions. Low-information density interactions usually include very constrained visuals: either a small screen, an LCD readout, or an LED display of some sort. Figures 7.4 and 7.5 illustrate two ends of the information density spectrum.

Figure 7.4 Low information density: a Fitbit wristband.
Figure 7.5 High information density: Netflix on Amazon’s Fire TV

Dynamic Devices

Observant readers may have already noted that these dimensions aren’t necessarily fixed, even for a single device. A person can be near or far from their Amazon Echo. Some devices are more constrained — customers are unlikely to use streaming video devices out of eyeshot. But, in other cases, a single device might support multiple “stops” on both of these spectrums, depending upon customer context.

While the examples here tend to be device-based, in reality you could say that an adaptive device is technically switching between intangible and anchored experiences. A single device could support multiple interaction types. The difference is that an adaptive experience has affordances that allow customers to choose how to interact in the moment.

Mapping the Multimodal Quadrants

Once we’ve answered these two questions, we can begin to understand where our product fits in what was once a dizzying spectrum of experiences. What have our customers learned to expect from experiences like this one, and where have those experiences fallen short?

By placing relative customer proximity on the X axis and information density on the Y axis, you can place most multimodal scenarios firmly into one of four quadrants, as shown in Figure 7.6. Note that there is no superior quadrant: each has its own strengths and weaknesses.

To apply this multimodal interaction model to your designs:

  • For your experience, answer the two questions specified earlier.
  • Where will your customer be in relation to the sensors you’re using during the scenario?
  • How much information is presented to your customer during a typical interaction?
  • Use the answers to these questions to place your experience in one of the four multimodal interaction quadrants.
Figure 7.6 Multimodal interaction model spectrum.

Note: Cross-Channel Experiences

If you have multiple devices or apps in your broader experience, you might end up working within multiple quadrants for a single project.

Understanding what quadrant you’re designing for helps you understand what assumptions, constraints, and possibilities should inform your team’s work, and will help you make more consistent decisions.

Adaptive Interactions (Quadrant 1)

Rich output, long-range interactions

In the Adaptive quadrant, your experience is capable of both close-range and long-range interactions. You can leverage this potential in a few different ways, as explained in Table 7.3.

Deciding to inhabit this quadrant is a conscious choice and requires both robust hardware capabilities and a willingness to implement an adaptive methodology to allow customers flexibility of input in the moment. Once your customer makes a choice of input modality, your designs may feel a bit like designs from other quadrants — the differentiating factor here is the ability to choose.

Note: Choice and Consequence

If you decide you will only support hands-free interactions, you’re actually designing for the Intangible quadrant. If you decide you will only support close-range interactions like touch, you’re designing for the Anchored quadrant.

Table 7.3 Examples of Adaptive Methodologies

Note: Who’s on (Voice) First?

The #VoiceFirst hashtag on Twitter has long been a rallying ground for passionate voice designers who were eager to see voice overtake haptic input as the dominant interaction model. However, there are so many other input modalities out there — not just for voice, but also for hands-free technology like gesture. Calling the Adaptive category “voice first” would ignore future potential, but it certainly includes experiences perceived as “voice first.”

Anchored Interactions (Quadrant 2)

Rich output, close proximity

Anchored experiences include a rich physical presence, and the customer typically remains close to the device involved in the interaction.

Note: Screen-Forward

We initially used “screen forward” to refer to experiences in the Anchored quadrant during my time on the Alexa team at Amazon. However, this was in the context of the Echo Show and FireTV, without considering the broader range of opportunities beyond the confines of a specific screen.

The assumption that your customer must be close to the device opens up many interaction opportunities:

  • Use of smaller fonts and higher density of written text.
  • Use of high-resolution photos, videos, or UI elements.
  • Reliance upon physical input devices like touchscreens, keyboards, or remote controls.
  • Use of voice as a shortcut for tasks that are time-consuming using physical input.

Anchored experiences that support voice often include near-field microphones, in which case you can assume that your customer is near the device and the screen when they are speaking. This enables you to use voice as a shortcut for your visual interface, as the Xbox did with Bing search.

However, this doesn’t necessarily mean you must support all tasks with voice. Navigational tasks like forward and back are particularly ill-suited to voice, and a remote control or touch control will yield a much better experience if available.

An emerging category of anchored experiences is virtual reality headsets like the Oculus Quest. Because these devices must be worn on the customer’s head, you can always assume proximity to the device, and the output tends to be richly immersive. Most of these headsets support both voice input and some form of controller-based haptic input, and in some cases, gesture interactions.

Direct Interactions (Quadrant 3)

Limited output, close proximity interactions?

Direct experiences are usually associated with small, self-contained form factors, like a Fitbit or Nest — or head-mounted mixed reality or augmented reality displays like the Google Glass and Microsoft’s Hololens.

The technical limitations associated with the form factors of these devices often lead to limited displays of information and associated context. It’s likely that these experiences require some form of training during the out-of-box experience.

The smaller these devices get, the more constrained their interactions become. Some in this category might only support voice via grammar-based voice interactions, due to processing power or internet access considerations. Others might support haptic input only, via one or two dynamic functional buttons.

Intangible Interactions (Quadrant 4)

Limited output, long-range interactions

The wave of voice-only smart speakers launched a thousand multimodal ships. These devices proved once and for all to the mainstream market that voice interactions could work on their own merit, without the crutch of a screen.

Note: Intangible vs. Voice-only

While the dominant intangible interfaces at this time are certainly the ubiquitous voice-controlled smart speakers, calling this category “voice only” ignores the other hands-free input technologies on the market, like gesture and computer vision.

Most intangible experiences today are voice-only experiences, designed so that the entirety of their feature set is accessed via audio output (usually earcons and speech) and audio input (usually voice). These devices do not require (and sometimes do not support) visual or physical interaction with the device to complete any key task. Note that this category of experiences can provide accessibility challenges for consumers with auditory disabilities.

Despite relying mostly on invisible interactions, these devices often provide alternative visual information, like limited LED status indicators. Intangible devices also often have a few physical controls for intents like volume adjustment and “cancel,” which could be hard to express during high-volume situations.

Interested in reading more? Order your copy of Design Beyond Devices here.

--

--