Designing for an audio web

Inventions like Siri, Evi and Indigo are beginning to give us a glimpse of what an audio web could be like, but we’re currently held back by the browser paradigm

Published in

I. M. H. O.

7 min readMay 20, 2013

I’m walking to a mid-afternoon client meeting in North London and listening to music as I go. Thinking further ahead, I can’t recall whether my meeting after this one is at 3:30 p.m. or 4:00 p.m. I hold down the button on my earphones.

The music fades and a customary bleep followed by silence prompts me to ask out loud:

when is my meeting, Team Catch Up?

The robot voice responds:

checking your calendar. Your meeting is at 4pm today.

The music fades back in but I press the button again and after the fade out and the bleep, I ask:

where is my meeting, Team Catch Up? Checking your calendar. Your meeting is at Dose Espresso.

I’ve arrived at my first meeting now and I head into my client’s airy hilltop offices. Through every four-metre window, from here I get a spine-tinglingly good view of London’s iconic skyline. Once we’re settled in, we discuss the way we’re employing a number of thoughtful design principles to make it more likely that people will spread the word about their next report. It’s our job to make it more likely that people who read the report will be inspired to share it with their friends.

In locating these buttons, we’re attempting to predict where the reader’s eye will fall in 20 microseconds.

This is it, our job. We persuade people to share our clients’ stories with the people they interact with. In the past, our tools were press releases and the people we tried to influence were journalists. Now, we are as likely to use interactive content painstakingly designed for a niche audience, embedded with snippets of code. In this case, we’re using code to zoom in, to draw a visitor’s attention to certain parts of the report. We represent these as expressive visualisations or quotes, and carefully locate sharing buttons next to them.

In locating these buttons, we’re attempting to predict where the reader’s eye will fall in the 20 microseconds it takes for the impact of a chart or a quote to hit. The button in effect says tell everyone what you’re experiencing right now. If we’ve done our job properly, the button invites an instant tap and immediately the reader will see — popped up and ready to send — almost exactly what they’re thinking in that moment. The message will be in short-form and contain within it something that may trigger someone reading it to tap on a link for more information, something like:

It’s reassuring to see that our holiday choices remain one of the best indicators of wealth, even after the financial crisis.

Design in context like this plays into the immediacy of the internet, and the ability to share opinions and information with specific groups using social media. It plays into mobile access to information, which is quite literally at your fingertips, no matter where you are. Such design costs little and can have a profound effect on the way a story spreads.

Yet — strikingly — it has taken 20 years of the web and at least five years where share buttons have been commonplace, for this type of thinking to become normal. Until recently, the social web has been limited by the inexperience of those who know how to embed a share button but have never used one in anger. As my colleague Luke Murphy recently put it so well:

On an article, having a ‘share this’ button at the top of your article isn’t conducive to the user journey. The user is most likely on that page to read the article, and by putting a call to action at the top of the article, you are asking them to disrupt their intended journey to do something else.

Contextual social design — like the tweetable summaries used at the top of Gigaom’s articles — considers the frame of mind of the user at the point they encounter the button, and it appears only to have taken hold within the past year. Until now, it has been lurking, just out of sight. What other design paradigms are sitting out there waiting, waiting to take hold?

Controlling a computer without looking

So now my earphones are back in and I’m walking out through the revolving doors of my client’s offices, getting my bearings. Where on earth is Dose Espresso from here?

Hold. Fade. Bleep. I ask: “How do I get to Dose Espresso?” Getting directions to Dose Espresso. Starting route to Dose Espresso. Head South on St John Street… The music fades back in and I start walking. Occasionally, where I have to take a turn, the music fades out and I’m directed towards Dose Espresso.

Using my voice immediately makes my interfacing with a set of databases far more like dealing with a human or a pet.

This is a relatively new experience, outside of 2001 or, if you like, The Hitchhiker’s Guide To The Galaxy. Using ears and my voice — instead of my fingers and eyes connected to a screen and keyboard — is different. Not just a little bit different, but really different. Using my ears is more intimate than staring at a screen. And using my voice immediately makes my interfacing with a set of databases far more like dealing with a human or a pet than anything else.

I’ve no doubt that researchers at Microsoft (this is an amazing demo), Apple and Google are spending vast sums to unearth the ideal paradigm for audio web interface. Start-ups, such as Ubi, are vying for a place in the home. So it seems as though we’re close to reaching a point where voice activation might become a viable interface alternative to the screen and manual input that we’ve been used to in recent decades.

Being productive with no hands

As I pass Smithfield Market, approaching Dose Espresso for my team catch up, I suddenly realise I’ve not looked at the minutes from our last catch up. Now, where are they, I wonder? I hold down the button again.

Fade.

Bleep.

Silence…

Of course, there’s no immediate way for me to start accessing data in the cloud or surf websites using the audio controls that have helped me manage my day. As I think about it, with contextual social design still buzzing around my mind, I begin suddenly to see what’s going on behind the scenes in apps like Dropbox, Google Drive or even a vanilla web browser. The metatagging, alt text, tabular data, images, downloadable files, clickable links and the means of connecting instantly with people I know depending on their availability — in fact, everything that has emerged through the browser paradigm in the past 20 years — are all unavailable to someone controlling it through voice and ears.

The more I think about it, the more I realise the browser paradigm, with its emphasis on graphical user iterface is just no good at all for what I’m trying to do here. Some things — like connecting with other people, and using search — are far closer to the natural process of audio control. Others — such as hierarchical file structures, quickly scanning detailed documents or browsing images — are alien to audio control. I worry that efforts to develop a standardised semantic web architecture will be tripped up if designers simply take what emerges and applies it to the spacial web we’re used to.

The language used by the W3C on its Semantic Web project isn’t reassuring:

but can I see my photos in a calendar to see what I was doing when I took them? Can I see bank statement lines in a calendar?

It’s very visual. What about can I voice-activatedly hear a forecast of my bank balance at some future date while I’m using a project management app to plan my wedding? Can I voice-activatedly search for a venue, check availability on a certain date and call to book, without — literally — lifting a finger? Now that would be neat.

Has anyone yet mastered audio user interface design for more complex tasks than using urgent bleeps to help park a car? There are some fantastic guides to the design thinking in this area. There are also a number of intriguing experiments into the relationship between information and sound, such as Peter Gregson and Daniel Jones’ “The Listening Machine,” or “Cybraphon” by Tommy Perman, Simon Kirby and Ziggy Campbell. Yet it looks like it could be decades before a useful, voice-driven AUI could become a genuine alternative to the GUI that we’ve become dependent on.

My only hope at this time is that it isn’t going to take 20 years before the design principles for an audio web are settled. Applying the learnings of all the web design culs de sac of the past should, I hope, allow designers to take shortcuts in the development of the ideal AUI.

But then, with hindsight, putting share buttons in the right place sounds pretty easy too and that simple thing took five years alone.

I use the older, more robust form of audio user interface to order a double espresso.

Designing for an audio web

Inventions like Siri, Evi and Indigo are beginning to give us a glimpse of what an audio web could be like, but we’re currently held back by the browser paradigm

Controlling a computer without looking

Being productive with no hands

Written by Peter Sigrist