What areVoice Augmented eXperiences?

Published in

slanglabs

9 min readMar 10, 2019

VAX or Voice Augmented eXperience is the concept of augmenting or enhancing an existing experience (usually GUI and/or touch-based app) with voice.

Why Voice?

Human-Computing Interfaces have evolved a long way since the early days of the punch cards.

If you are reading this in 2019, chances are you have experienced voice recognition in some form or the other already or have read articles touting its imminent arrival. Here are some of the advantages that make Voice one of the best interfaces out there.

Reduced Thought-to-Action Latency

You think of something and you can get it done almost immediately. It’s magical when you can just speak out what you want without worrying about how to convey what you want to convey your intent using a series of individual clicks. “Play raakamma kaiyatattu” (a popular Tamil song) is so much faster than “find the search icon, click, type r.. a.. a.. k.. a.. m.. m.. a” (hopefully auto-complete will kick-in, else) “.. k.. a.. (are we there yet) i.. y.. a.. t.. (are we there yet).. t.. u..”.

Breaks barriers of entry

While most of us (the folks reading this blog) might feel comfortable with the modern UI paradigm — menus, buttons, text boxes, etc — for many others (folks who are mobile first, not so tech savvy, elderly, etc), these are actually strange concepts. It’s actually quite intimidating. A button does not naturally mean “click on it”. A menu icon does not naturally mean “discover or navigate to other capabilities”

Thought experiment — how many of our parents can use the apps that we build?

Voice on the other hand does not intimidate. It does not need any additional training. You don’t need to teach your parents how to use Voice. They are already a pro just as you are.

Language of choice

The language of the user is a key constraint that Voice handles better. Even if the user is familiar with the UI elements and is able to navigate the app, they tend to be intimidated by the language on the screen.

Here is another thought experiment — Change the language setting of your phone to a language you don’t know. Reboot your phone. Try and change the setting back to English. Do you think you will be able to do this easily? If that phone had a Voice interface which would have understood English (“Change language to English”), would you have used that or still tried to use touch to carry out this action?

It’s Natural

We (well most of us) are all born with the ability to communicate using our voice. Imagine doing something as simple as getting your bank statement in your banking app. If you are like me, I assume this would be your journey — “hmm.. which menu item should I choose?.. okay let me try that.. nice.. a date selector.. click.. click.. shit.. how I do change the year?.. click.. click.. click.. click.. one done.. submit.. damn.. forgot to set the end date.. another yummy date selector.. today’s date (duh.. shouldn’t u have already done that).. submit (finally).. oh.. Did I really spend another 2000Rs on Swiggy last week? Wonder if I can claim stake to be an investor?”.

Contrast this with being able to do the same via “Show my statement from Jan 1st”.

Unless you are trying to exercise your fingers, presumably the later feels like “Hey, why was it not like this all the time?”

How is Voice used today?

For most of us, Voice is associated with the notion of using generic and intelligent assistants that users can speak to (or optionally type). Brands or services that users want are placed behind these assistants. The assistants understand the intent of the user and either explicitly or implicitly connect with a service provider and provide an output. Typically the assistants prefer to keep both the input (what the user is asking — “Set an alarm for 6 am”) and the response (what the service provider generates — The actual setting of the alarm and the response text “Sure. Alarm set for 6 am”) inside its own surface. In some cases, the response could also be done by deep linking into the service providers app and opening it up.

So is this the end of apps?

So if there is a rise of the generic assistants with their own interface (voice-first or in some cases voice-only) and surface, does that mean the end of special purpose, brand-specific surfaces and other interfaces like touch and GUI?

I think not. Let me try and articulate my thoughts on why.

Why do stand-alone and touch/GUI based apps still make sense?

Purpose built, custom and direct-to-brand experiences, delivered using mobile apps, web apps/sites, PWA, etc are also important and are in fact growing in usage.

Another thought experiment to match the data shown above — imagine your usage through the day on your phone. Do you spend most of the time with an assistant or directly with one or more apps?

But what advantages do mobile and purpose-built touch/GUI surfaces actually provide?

Here are some of the key advantages that mobile apps (and their likes) provide (as compared to generic voice assistants)

Discoverability

App discovery was a challenge during the early days of apps. But the idea of app stores caught on quickly and standardized the idea. Discovering and adding capabilities (skills or actions) to the assistants are currently a challenge. There are 30K plus Voice apps but most people are oblivious to their existence. Maybe over time as brands start advertising their skills/actions and the assistants come up with an easy to discover/add them, it would be lesser of a problem, but today they are a major challenge.

“Limited knowledge of the full breadth of capabilities (of the voice assistant)” — PWC report

When using a mobile app, the context and the capabilities of the app is typically well understood by the user. The visual structure (buttons, menus, etc) also gives enough clues as to what is possible. Contrast that with the experience of talking to a specific Alexa skill and then struggling to know what its capable of.

UI only functionality

There are lots of use-cases where visuals beat voice straight out. One such example is anything that needs a list. Imagine a user trying to book a flight or ordering a pizza. He or she needs to see the available choices in order to actually perform the transaction. Imagine doing that with Voice?

You: “I want to order a pizza”

Assistant: “Sure. What type of a Pizza would you like?”

You: “What are my vegetarian choices?”

Assistant: “You can order a garden veggie, a farm fresh, a Mediterranean… “

You have probably zoned out and don’t remember the first one when the 3rd one is being spoken.

This is where UI lists make perfect sense. Imagine the response to the request above was instead something like this -

Engineering maturity

When building apps, the pieces of the puzzle needed to make it happen is all well understood. How does someone login? How do you personalize? How do you store the details of the user? All the key ingredients needed is well understood. But when building voice only experiences under assistants, many of them need to rediscovered or rebuilt. For eg how do you login to a voice only assistants, maintaining context across sessions, etc.

Privacy

Last but not least is the notion of privacy. Both for the customer of the app/brand and the brand itself. When a customer is interacting with an app, they are placing the trust on the app. Both their input (their intent) and the response from the brand is based on a trusted relationship between the two. But if there is an assistant in the middle, who processes both the input and the output and also knows who you are (because you are logged into the assistant), it triggers privacy concerns. Both for the user — because they are trusting their data with someone who understands a lot of details about them (across brands), and also for the brand — because they are sharing details about their customer to someone else.

So both Voice and Apps have their individual strengths. What are you getting at?

Voice is great and overcomes the traditional problems in Apps. And Apps are great and don’t have the newer problems of Voice.

What if you could have them both in the same medium? What if we could augment the strength of Apps using the power of Voice instead of trying to replace apps with voice?

VAX

VAX or Voice Augmented eXperience is what you get when you add a multi-modal and multi-lingual voice interface to mobile apps, thereby allowing the user to interact with them via voice or touch.

A concept demo of VAX on banking apps

Imagine being able to do things like the following —

Open a travel app and say “Show me flights from Bangalore to Chennai two days from now” and the app responds by showing you a list of flights. You can then continue speaking to the app if you want and say “How about morning flights?” instead of having to figure out how to apply visual filters. And at the same time, have the ability to apply more complex filters via the regular UI experience, things which are complex to put in words.
Open a camera app and instead of struggling to find timer settings, just say “take a selfie after 8 seconds” and the app does it and does it only for this shot. The next shot you take (either via touch or voice) does not apply the timer (which is what you would expect). Transient operations are a lot more easily done via voice than touch (which is typically sticky).
Open a banking app and say “Transfer 1000 Rs to my friend Kumar Rangarajan” and the app takes you deep into the transfer operation with all details filled for you to quickly review and do the actual transaction via touch.
Open a telecom app and say “मेरा बैलेन्स क्या है” (what is my balance?) and without the app being localized, it still understands you and takes you to the page which shows your balance. Or if it’s appropriate even speaks out the balance.

The possibilities are quite vast and only limited to how a brand wants to use the idea of having access to a smart microphone (which *understands* the intent as opposed to just understanding voice signals).

How do I add VAX to my apps?

The article was written to explain the rationale behind the idea of adding voice experiences inside apps. The What and Why. With regards to the How, there are potentially multiple ways to do this. My company built an entire platform to simplify this process. The platform takes care of all the elements needed to add a smart microphone to your app and lets you focus on your core business logic related actions.