Understanding Audio Augmented Reality
Augmented reality is about adding virtual content to the physical world. So far the emphasis has been on augmenting what we see. With the growing popularity of wireless earbuds and voice assistants, let’s explore what happens when we shift the emphasis to augmenting what people hear, or how they experience sound in the first place.
Rather than offer a prescriptive definition, I propose that what we call augmented reality is a combination of form factor, user interface, and the actual experience itself. Let’s apply that framework and see how the components of the Audio AR ecosystem are shaping up.
Form Factors for Audio AR
The ideal form factor is one that lasts all day long and is so small and unobtrusive that the user can almost forget they’re wearing it.
Wireless Earbuds (e.g. Airpods, Pixel buds) A small form factor and an effectively all day battery life, wireless earbuds now have transparency and noise cancellation modes that mix digital and physical content, making them the ultimate form factor for audio augmented reality.
Over Ear Headphones (e.g. Beats Studio 3, Bose QC35/700) The bulky form factor of over ear headphones precludes them from being an all day wearable device. They are great at isolating you from the real world and immersing you in digital content. In this way I think they are more comparable to VR headsets than AR devices.
Audio Glasses (e.g. Bose Frames, Amazon Echo Frames) audio glasses use speakers embedded in the earpieces to allow a user to listen to music/notifications while still hearing the outside world. In my experience, the outside world can very quickly drown out the content, so this form factor might be best for notifications and maybe single action responses, similar to North Focals.
User Interface for Audio AR
Augmented reality is associated with a more natural UI layer — one that allows a user to engage with digital content in a way that keeps their hands free to focus on other tasks. For Audio AR, the hands free voice interface is well established, and there is potential in a UI around recognizing one’s head gestures as well.
VOICE UI/ALWAYS ON ASSISTANT
An “always on wake word” is a term used to describe a device feature where it is always listening for you to say the wake word (e.g. “hey Siri”) and doesn’t require you to push any buttons or touch anything to start the process. It’s almost completely friction free. Many earbuds will connect you with a voice assistant, but they require you to push and hold a button or some equivalent action. Airpods Pro were the first to have this hands free feature, and an always on wake word seems to be the new normal moving forward. While touching a button to cue a virtual assistant instead of having an always on wake word might seem similar in many ways, only the always on wake word is a true hands-free interface, which is a defining characteristic of augmented reality experiences and hardware.
HEAD TRACKING/GESTURE CONTROL
Motion sensors in earbuds can identify gestures and let you control content accordingly. One example is the “remove to pause” function where the earbuds will automatically detect if they’re removed from your ears and pause the music. Bose was trying to take gesture recognition to the next level by identifying when a user was nodding their head yes or no and allowing them to respond to content without needing to use their voice or screen, and even looking into the ability to measure physical activity that phones and watches struggle to measure like push-ups. With the now defunct BoseAR SDK, Bose had also given developers access to the on board sensors to come up with their own ideas for using head tracking and gesture control.
Experiences for Audio AR
There are two kinds of experiences I’ll explore: augmented hearing, and spatial audio content.
Change how we perceive the real world by adding a digital layer
Augmented reality adds or subtracts from what naturally exists, so for audio we can consider the blocking or amplifying of real world sounds.
Hearing aids — Truly the first and most obvious device for augmenting what we hear. Not only do hearing aids “increase the volume” of the world around you, many have advanced equalizer settings programmed by audiologists, and can switch between presets on the kind of aural environment you find yourself in (outside, crowded restaurant, subway, etc.).
Earplugs & Active Noise Cancellation (ANC) — Moving to the other end of the spectrum, earplugs are effectively the opposite of hearing aids, and ANC is comparable in that it too looks to block or cancel out external sounds. Airpods Pro are the one of the first wireless buds to have this feature, and that plus their hands free voice assistant is why I call them the first proper augmented reality earbuds. It’s interesting that of the new generation of earbuds from Google (pixel buds), Amazon (echo buds), and Microsoft (surface buds), none are noise canceling.
Transparency Mode— Apple’s transparency mode has similarities with hearing aids and feels like the opposite of noise canceling since this setting effectively dampens the digital content and amplifies sounds from the real world, increasing how aware and present you are with your surroundings.
Adaptive Sound — Google has what at first glance feels like a similar feature to transparency mode, but from what I understand they are actually taking the opposite approach and automatically make your content louder to block external sounds (e.g. a blender in a coffee shop turning on) instead of keeping your content at the same volume and working the dampen the external sounds. I haven’t tried this myself, but from a UX perspective I wonder (1) does this make the music too loud to compensate for the blender sounds? Is there a dB cutoff? (2) does it return to normal volume once the blender turns off?
Audio Filters — As mentioned, hearings aids have equalizer presets that change depending on the kind of sound environment you find yourself in.
Adaptive equalizers feels like something consumers of all kinds would like access to, and will start to blur the line between earbuds and hearing aids.
Another class of audio filters could be specific to voice. Imagine if your earbuds could autotune the voice of everyone around you? While live autotuning everyone around you might not be feasible quite yet, TikTok has many voice filters you can apply to a video in post-processing, and their popularity shows that there is an interest in this kind of audio content/manipulation. I would not be surprised to find out that some podcasters have been “photoshopping” their voice (please let me know if you are familiar with this). Now that Twitter is testing their new voice tweets, the sound of your voice will become as important to your identity as the way you look, so there’s no reason to think filters won’t get involved. For the purpose of this article, I would put the Google pixel buds translate features into this bucket of “audio filter.”
SPATIAL AUDIO CONTENT
New kinds of responsive content
Many spatial audio experiences exist and will continue to be developed. There are two kinds I’d like to explore: being able to create an incredibly immersive listening experience so a stationary user feels like the music is coming from all around them (see picture below), and creating a sonic landscape that a user moves through.
Immersive Listening Experience (Sony 360 Reality Audio ) — If you haven’t tried 8D audio, put on headphones and check this out. When it truly feels like the music is moving around you, it is an incredibly immersive experience. As I understand it, Sony360 is meant to be levels beyond this. It is a new way to mix music — Sony uses 128 points around a user that the music can be generated from. It supposedly creates a level of immersion well beyond the 8D example. I’ll have to check back on this after getting to try it myself.
Sonic Landscape for Discovery (BoseAR) — Imagine being in a room with a musician in each corner. As you walked closer to the singer and further from the guitar, the vocals would get louder and the guitar would sound quieter. If you stood still and turned in a circle, the music would sound different as you spun around. By putting motion sensors on their headsets, Bose was trying to enable this kind of content with their BoseAR initiative. By knowing how a user is moving through space and where their head is pointed, Bose was also working on tools so developers and storytellers could create audio tours and stories with much more interactivity than currently exists. While the concepts made sense, Bose was unable to fully execute on these ideas and ended up shutting their BoseAR program down.
What’s Next For Audio AR
Rather than try to define Audio AR, I have explored the various components of augmented reality that are appearing in the audio space with the framework:
Audio AR = Form Factor + UI + Experiences
We have passed a tipping point with regards to Form Factor (wireless earbuds) and UI (voice), and I believe we will see more emphasis on Experiences moving forward. Specifically, I believe the time is ripe for earbuds to act more and more like hearing aids, so I anticipate seeing more features around being able to better hear the world around you. The earbud form factor can continue to improve, and it may need additional sensors to enable new experiences in the future, like biometric sensors or ultra wideband chips for spatial awareness. Perhaps these sensors will also end up enabling new UIs.
In this post, I’ve focused on how Audio AR might change the way we experience sound through augmentation or spatial audio experiences. In a follow up, I will explain how I believe Apple’s approach will be more focused on changing the way we communicate and share audio experiences together.