End-to-end audio: The bridge to the next big thing
Consensus says the voice interface is the next wave, but audio files are the missing medium.
I’m willing to accept that the voice interface is the next big thing. Interoperable, ubiquitous voice assistants; hands-free, screenless User Experience; it all sounds good to me! It’s a logical progression too, considering how Moore’s Law (¯\_(ツ)_/¯) encourages our graduation from data-lite text into ever richer media formats — like images, audio, and video. But how do we get from here to the voice interface?
First, let’s discuss where we are today: stuck in some kind of limbo between the text-based and voice-based regimes. For example, when we produce content, we either type or dictate our inputs, which are transmuted into text, then distributed to consumers, who either read the text themselves or have it dictated to them by a voice assistant. It’s rather inefficient to make those conversions from audio to text then back again throughout the process.
For the optimal user experience, the process should ideally begin and end with audio. At least in theory, voice is the most frictionless input and output — as opposed to typing and reading. Yet, somehow, text remains the chosen medium, because in practice, execution foibles add a lot of friction to our utilization of voice. Those considerations include…
- Voice recognition/natural language processing accuracy:
e.g. having to re-read and correct transcriptions eliminates all potential convenience;
e.g. we don’t always want to audibly dictate or listen to messages in public;
e.g. we’re not used to thinking voice-first, and we don’t feel natural while verbally dictating to a machine
There is a way to fix all of this, thereby encouraging the transition to that optimal, frictionless, voice-based user experience. I call the solution “end-to-end audio.” Developers should focus on forming user habits around voice as the primary medium: not only should voice dictation be prominently featured as the default for all inputs, but that input should remain in audio format, end-to-end, start-to-finish, and everywhere in between. In other words, messaging should be voice dictated, captured as an audio recording, and delivered as an audio file. Similarly, an article on a webpage should be a simple play button on an audio file. (Sure, add a pretty picture as cover artwork too.) In both cases, there is no text. The text is abstracted away by default, and it’s succeeded by audio.
Changing user habits is hard, right? Of course it is, but think about those frictions listed above: our reservations with using voice/audio today are all obviated by end-to-end audio…
- Voice recognition/natural language processing accuracy:
If inputs remain audio recordings by default, they’ll sidestep the problem of text transcription errors entirely... by avoiding text transcription altogether. In fact, this enables a truly hands-free, screenless experience.
My first instinct is to state the obvious: we’ll get used to dictating and listening to media in public. Plus, functional and aesthetic advances in headphone technology expand audio’s use cases. But there’s more to it than that, because there are some situations that structurally require a different approach. To wit, I chose my words carefully above: “voice as the primary medium.” Primary doesn’t mean exclusive. For example, text is the primary means of media interaction today, but audio/voice are almost always available as secondary alternatives. Driving in the car and can’t read your phone? You have your voice assistant read it aloud to you. The prevalence of these backup options means media is already fungible, allowing us to consume or produce anything, in any format — text, audio, and even video — jointly and severally, with relative ease. We can already choose how we want to interact with every piece of content, depending on situational context. When we invert users’ habits by prioritizing audio over text, text will still be there as a backup when necessary.
If developers start featuring audio as a default, they will accelerate user adoption and the formation of new habits — especially since the value proposition (reduced friction) is so strong for the majority of use cases. For example, the friction added by voice recognition/NLP inaccuracies will be enough of a disincentive to deter users from manually seeking the secondary solution, text, in lieu of the seamless default, audio. (Why suffer or risk a transcription processor’s errors in its audio-to-text conversions when it’s so much easier to just interact with the original audio?)
End-to-end audio is a huge opportunity for first movers. The movement is still in its infancy, but it’s already somewhat manifest in familiar vehicles like podcasting. It now needs to subsume more of our media diet — from entertainment to information, gaming to communications, storytelling to journalism, personal to professional. It also needs the fungibility to accommodate use across all of our situational demands.
End-to-end audio isn’t change for change’s sake, like chatbots for news. Nor is it a square peg in a round hole, like VR for current events. It is very much a natural, native format, which eliminates friction, making the user experience materially easier and better. That’s the hurdle. With digital media still trying to find its footing in the age of abundance, end-to-end audio can be the regime change that truly fulfills tech’s promise to facilitate and enhance our interactions — rather than encumber them.
Anthony Bardaro is the CEO of Annotote, an app that lets you highlight and take notes on any media, then your annotations help summarize that content for everyone else in the network. Don’t waste your time, get straight to the point! For content worth keeping, try Annotote today: http://annotote.launchrock.com