The role of visual media in conversational interfaces

When talking about the future of human-computer interaction, many are advocating conversational computing experiences. And this future is fast approaching. Amazon’s Alexa is in more devices than ever, the Google Assistant is also expanding and even Spotify is rumored to launch an in car voice controlled device as its first hardware product.

Natural language as input

When talking about the future of human-computer interaction, conversation comes in more in a sense of replacing our input devices. The keyboard and mouse has served us for a very long time, mostly because computers just didn’t understand natural language input. Today, speech-to-text algorithms are state of the art in all but the most adverse of conditions. This leaves the understanding of commands as the next challenge in order to advance natural language input as something that can replace other input devices in more advanced computing tasks.

Distribution of devices for Amazon Echo’s 40 million users compared to about 16 million for Google Home

However, intent detection is easily on the level where it can be used in narrow domains. The perfect examples are the conversational assistants millions of people use every day. People need to learn the limitations of what type of tasks they can perform, but once they do, intent is recognized with surprising accuracy. Devices like the Amazon Echo, Google Home and every Windows 10 device can tell you about the weather, set timers, look up basic information and command other devices to perform set actions. In a way, the conversational computing scene is already immensely powerful.

However, I do think that what we should think of when talking about conversational experiences should actually be rephrased as conversationally driven experiences. Hear me out.

Natural language as the output?

All the above devices take spoken natural language as input and respond to the user in spoken, machine-generated voices. However, both Google and Amazon already recognized that a purely voice-driven computing experience has its challenges and came out with their voice assistants that sport screens.

The first challenge is that there is no indication of what the machine understands while the user is giving it a command. When talking to a fellow human, there are visual cues of them receiving the information. This metadata is a very important part of communication. People do prefer video calling for at least partially this reason.

The other is actually more of an opportunity. Some type of information is simply better to be presented in a visual way. A great example is directions on a map. A map with the route overlayed will always be superior to a long description of the route and all the turns that need to be taken.

Another good example is a weekly weather forecast for example. It has information that has multiple dimensions and all of them need to be conveyed in order to provide the necessary information. In a weekly weather forecast, the machine needs to convey how the weather will be on each of the seven days of the week. Furthermore, weather can change throughout each day. On top of this, weather has multiple qualities, such as precipitation, temperature and wind. Not to mention UV levels or air quality. Think about it, when someone asks you about the weather for a longer period next time, wouldn’t you show them a nicely laid out card with the information presented on it rather than explaining each day? We humans lack the resources to pull this off, but computers don’t.

How we do it

An example where I have done some work personally is conveying information regarding upcoming events. I am involved in building Super Izzy, a conversational assistant about reproductive health for women. Super Izzy today works on Facebook Messenger, but we have big plans to expand to other conversational platforms. One of our proposition is that we keep track of the users’ periods and notify them when they are to be expected based on their past cycles.

The Job-to-be-done (JTBD) here is to give the user an indication of how far their period is and what days it will fall on. Will it come on the weekend I am planning to go to the beach? We could have simply told the user that: “Your period will be starting on the 28th of March and will end on the 1st of April”. This would put an unnecessary cognitive strain on the user. What could be better is: “Your period will start in 12 days and will last 5 days, including a weekend.” Not bad, but we could do better.

Personalized, dynamically generated images for period prediction

However, the conversational platform we use allows us to send images as well as text. So we do just that. When the user is asking for their next period, we generate a calendar image based on the applicable data and also tell the user in text how far their period is. On top of this, we highlight the current date so the user can grasp where they are in relation to the event. One more thing that we do is we organize the days and weeks of the month in the familiar format where the weekend falls at the end of each row. The result is crucial information at a glance and secondary information ready to be absorbed if necessary. We could not have provided such a large array of information in such a granular manner if we would have only used natural language

What’s next?

So I mentioned above that the JTBD is to know what effect the upcoming period will have on the user’s plans. So what could be useful (and very cool) is if we could get access to the calendar of the user, classify their upcoming events and then show them on our period calendar visualization as categories as well.

And yes, we do have big plans to bring Super Izzy to other platforms than Facebook Messenger. This includes platforms that support text inputs, outputs and visual media, like Kik, Viber, WhatsApp, iMessage or the Google Assistant. But also devices like the Google Home Hub, or Amazon Echo Show, where the input is speech and the output is a combination of speech and visuals. And we would also like to be there for users on voice only interfaces like the Google Home and Amazon Echo. So how can we make sure that our solution can serve all these devices and platforms?

The answer is developing a User Interaction Gateway (UIG) that can handle these types of inputs and outputs from the same back end. After all, it is receiving commands, recognizing intent and then presenting the relevant information to users on the platform of their choosing. To achieve this we need to store responses as strings. Responses where there is a possibility to use images needs a “translation” into a text string so that voice interfaces can read them out loud. Something like this:

What do you think? Should conversational assistants make use of visual media when applicable? How would you determine what is the right type of information to present visually? Leave a comment below, let’s discuss!