Multi-modality in #VoiceGames

Published in

#VoiceFirst Games

10 min readAug 12, 2018

Multi-modality is a hot topic these days, and Dustin recently devoted an entire article on the distinction between voice-only, voice-first and voice-added applications. In this article, we’re reviewing both the state of the art and future potential of multi-modality in #VoiceGames.

Let’s first find some common terms that we can agree on. In a strict sense, there’s a distinction between modalities, by which users transfer information towards a system , and channels, by which a system transfers information towards the user — Modalities are about input, channels about output. If you have a traditional TV set, it entertains you on the visual and the audio channel, and is controlled by the touch modality, i.e. the pressing of buttons on the remote.

To simplify matters, I’ll talk about device types as modalities, and ditch the distinction between channel and modality. These are the modalities we’ll be encountering:

Voice interaction, as in smart speakers and screens, mobile phones (Google Assistant, the Alexa app, or Reverb) or exotic things like browsers (EchoSim.io) or TVs (FireCube). In the scope of this article, the visual capabilities of their light rings or dots are part of the voice interaction and don’t count as a separate modality.
Auxiliary screens, in the sense of the display interface and Skill cards of Alexa Skills (on Echo Show and Spot, and what contemporary Fire tablets and the FireCube display when a Skill is active)
‘Classic’ computer-like devices such as PCs, laptops, tablets and smartphones (henceforth referred to as ‘computers’). Their screens can physically be the same as the auxiliary screens mentioned before, but have much higher flexibility in both their input (keyboard, mouse, touchscreen) and output (browsers, apps and video games) capabilities.
Gadgets, of which there are currently only Echo Buttons. But theoretically, anyone could build a custom device (like a Raspberry Pi with some fun peripherals) that acts as a gadget ‘modality’.
Physical artefacts — This really stretches the sense of what constitutes a modality… But let’s say you have a ‘musical chairs’ voice app, then the physical chairs are part of the experience similar to a screen or a gadget, except that the voice app might have to ask about its current state.

Equipped with this common understanding of modalities, let’s look at the spectrum of how dominant the voice modality is in different gaming experiences.

This diagram (familiar to those among you that attended my Voice Game Panel at the Voice Summit 2018) distinguishes voice and non-voice modalities, with the blue ‘voice’ part’s width being proportional to its dominance in the respective game. If a non-voice modality is optional, it’s width in the diagram is accordingly decreased.

Voice-only games

Such games have no other modality beside voice, and are by far the most common ones. For many games this is a good choice — Especially when the cognitive load is low (such as in interactive stories), a screen provides little value.
Some games can be surprisingly complex without needing any screen, as long as the complexity can be approached in a linear way. One nice example of this is Lemonade Stand, where optimizing three variables in a changing environment is made into a kids game with the magic of storytelling.

Voice games with auxiliary screens

This group of multi-modal games can be divided into the three subgroups, with increasing relevance of the auxiliary screen:

Different levels of multi-modality for games with auxiliary screens. Games: Jurassic World Revealed (upper left), ALIEN Offword Simulator (u.r.), Panda Rescue (m.l.), Six Swords (m.r.), Deal or No Deal (l.l.) and Song Quiz (l.r.)

Screen-decorated voice games

Such games use static images for certain scenes or situations, with the benefit of creating atmosphere or stimulating the user’s imagination. Examples of this are Jurassic World: Revealed and ALIEN Offworld Colony Simulator.
Such ‘decorative’ screens are characterized by a lack of both fine-grained information about the user’s current game state and actionable elements. The latter can prove detrimental to the game, if the user is tempted to touch the screen and thereby deactivate the smart speaker’s microphone.

Echo Show display augmenting the Chessboard Skill by representing the very complex game state of chess

Screen-augmented voice games

The next level of relevance for a screen is to display dynamic data about the user’s game state. A particularly pronounced example of this is Chessboard or Six Swords, both of which display ‘maps’ that lift quite a bit of mental strain.
Screen-augmented voice games share the danger of their sessions being disrupted by tentative touch interactions by the user.

Screen-assisted voice games

In this soub-group, the screen presents some of the choices that can be made in a scene (currently in either a horizontal or a vertical list format, like here with Deal or No Deal, but personally I imagine that at some point we’ll be able to use image maps to make arbitrary areas of the display clickable) and offers the user to select them by touch. Personally, I think this is the subgroup of games most deserving of the description ‘voice-first’, because it offers two fully developed modalities, of which voice is the dominant one.

Voice games combined with computers

Remember, within the context of this article, ‘computer’ refers to PCs, laptops, tablets and smartphones — So basically the ‘modality’ we’re discussing here are browsers and applications, without much of a distinction whether they are controlled by keyboard and mouse or by touchscreen.

Computer-extended voice games

Some voice games have promotional websites, and in most cases they are simply part of the funnel that leads users to the game — And I obviously don’t consider this part of the gaming experience. But there are a few cases where companion websites (or social media presences) are created for retention and to extend the gaming experiences by providing more information. Examples of this are Kids Court (with a website) and Question of the Day (with a Facebook and Twitter account).

Computer-augmented voice games

In this group, a voice game has a companion app or website, which offers either new information or a more visually oriented representation of the state of the user in the game. There are not many examples of this group that I’m aware of, but two examples are Panda Rescue, which has a website with a global highscore, and — to a higher degree — Rob McCauley’s State Games, in which the user has to make a chain of US states, and where the website shows the map with dynamically colored states and a list of viable options.

Screenshot of State Games’ companion website, which presents rich information on the game’s current state, but no input (i.e. the states on the map or in the list are not clickable).

Computer-assisted voice games

This group is, to my current knowledge, only hypothetical, as I’m not aware of any such games. Such a game would have a companion app or website which you can use to receive input from and send output to your voice app. The aforementioned State Games would qualify as such a game, if you could select neighboring states by selecting it (with click or touch) on the computer.

Voice-assisted computer games

Architecture of Liam Sorta’s VR game with MILO as it’s Alexa-enabled robot companion

Such games are stand-alone computer games where you can optionally control a (minor) aspect of the game via voice. Three examples are Destiny 2, where you can control part of your inventory and clan communications via the ‘Ghost’ Alexa Skill, StarCraft 2 with a (yet to be published) Captain Kate Blackwater Skill, with which you can produce and send reinforcements and evacuate units while being engaged in battle, and a demo VR game by Liam Sorta where Alexa is represented by an in-game robot companion character MILO.
Personally, I think the computer game industry has recognized this potential, and we will see more such gaming assistants in the next months.

Games with equivalent voice and computer modalities

This is a group I’m particularly intrigued by: Games where you can choose your most convenient modality in a particular context, and still have a similar gaming experience. Right at this point I’m not aware of any such game, but there are some that are either close or easily conceivable or at a proof-of-concept stage:
Skyrim Very Special Edition is obviously a (brilliantly executed) parody of exactly this concept: Imagine that in your car, you can actually continue your video game adventures from last night, but in the form of an interactive story / role-play game! The depth you loose on the visual side would be compensated by dense storytelling elements… For a complex game like Skyrim, this is far from realistic at this point, but it’s a nice illustration of where things might go.

A more realistic example was demoed by Cami Williams at Games Developer Conference 2018 (if I get hold of the video link, I’ll post it here): A virtual pet simulator where you can interact (i.e. feed, pet, play, …) with a cute blob either on your computer or via an Alexa Skill. On the computer you have a visual representation of the blob and its stats, whereas with voice (I’m getting a bit speculative here) you either get the stats either proactively upon starting the Skill (Welcome back! Your blob is so happy to see you… It’s a bit hungry, though!) or by explicitly querying them (Is my blob hungry?).

It’s easy to imagine for many browser or mobile app games to have a voice modality like this. Thinking about it, the reason that it’s not more prevalent right now is that the computer modality has the capacity to captivate players both longer and deeper, whereas the voice modality makes it more casual — To the potential detriment of the game producer’s revenue (either with advertising or in-app purchases). I’m curious to monitor how this develops in the next months!

Voice games with gadgets

This category of games is easily mapped, because buttons always augment voice games with both haptic input and visual output. It’s a very engaging type of interaction, both because of the strong physical element of touching the button and because of the strong visual stimulus of the big brightly colored and animated button.

Voice games with artefacts

There might be a better word for this that I’m not aware of (maybe utensils or requisites), but what I mean are physical things (whose state Alexa can’t access as an IoT device) that play a role in the gaming experience, like dice, cards, game boards, chairs, pen and paper, and so on.

Artefact-augmented voice games

Such games can be played like voice-only games, but playing them with artefacts increases the user experience. Real-world examples for this sub-group are ALIEN Offworld Simulator (again!) and Mein Auftrag, both of which suggest using pen and paper to make notes (the ALIEN game even has a character sheet to download and print out). A fictional example of this sub-group is a dice game where either Alexa throws the dice for the user, or the user can use their own dice and tell Alexa the result. In this case, the voice app is in a good position to control and track the user’s stats.

Voice-assisted artefact games

he ‘One Night Ultimate Werewolf’ game and its unofficial assistant Skill ‘Werewolf Announcer’

These are games which can be played by themselves, but where moderation through a voice app potentially improves user experience. A simple example of this category is Musical Chairs, and a more complex example is One Night Ultimate Werewolf: Each each player has a hidden role card, and moderation is so algorithmic that it can easily be done by a voice assistant instead of a human.
In such cases, there is typically too much going on for the voice app to try and keep track of, so its role is more restricted to leading the players through the stages of the game.

Voice-enabled artefact games

These are artefact-based games that can only be played with a voice app (unless severely modifying their mechanics). Currently there are not many such games… Personally, I’m only aware of the fabulous When In Rome, and a German game called Schau Schau where Alexa asks you to find which among several picture cards contains a given combination of items. In these games, the voice app is in a good position to monitor the game state closely and take a very active role in it.

Games with more than two modalities

In theory, there are games that use several modalities at once… If we go wild, we could imagine a battle Skill where you see your opponent with their stats on the screen, and can make your moves both with voice and buttons. Or a crossover game of Destiny 2 and Fallout 4, where you control your inventory with a combination of a mobile app and voice.
At this point, I’m not aware of voice games where multiple non-voice modalities can be used in combination with each other — Even though there are games like Panda Rescue and the ALIEN Skill that have multiple non-voice modalities that are independent of each other.

Conclusion

Multi-modality with voice games is a multi-facetted topic, and my device-oriented discussion is only one perspective of it. I’m sure that as the voice game ecosystem gets more saturated, multi-modality is a feature that has the potential to make games stand out.
On the other hand, voice offers a nice way to complement established genres like video, mobile and board games. I’m curious to see interesting things blooming where these genres and voice games meet!

What’s your favorite multi-modal gaming experience with voice? Did you find my distinction of modalities by device types useful? Which aspect have you missed or would have loved to hear more of? Where do you disagree? I look forward to extending the discussion here or on Twitter! :)