Audio AR: An Introduction

Part 1: Towards Hands-Free Eyes-Free Interaction

Manaswi Saha
Labs Notebook
6 min readJun 6, 2023

--

Co-authored by Wendy Ju, Mike Kuniavsky, David Goedicke

Your most awaited Ikea purchase has just arrived! You open the package with excitement to meet a very long instruction booklet. You feel the overwhelming pressure slowly rising, until you see the Audio AR compatible hardware logo. You immediately pick up your AR-enabled headset — glasses with head-mounted cameras and audio sensors, scan the QR code on the package, and get to work! The glasses use spatial audio and other non-speech audio cues to immediately guide you from choosing the right components in the correct order to minutely observing how well you are doing to correcting you when you are about to make a mistake. For example, with the combination of your headset and a sensor-mounted screw driver, the system stops you before you split the wood, via continuous audio tones, if you turn the screw more times than necessary.

That’s the power and potential of Audio AR!

Furniture assembly is just one of many possible tasks. Think: firefighters learning to train, pilots learning to operate controls, DIY enthusiasts learning to build computers, repair cars — the list is endless. Audio AR has the potential to guide you without the visual distractions from the task at hand, like referring to a booklet or a video. A new future is unfolding where vision-focused training will be long gone. By tapping into the power of sound, we are unlocking a realm of sensory exploration and narrative richness previously unexplored.

Let’s dive into this technology with our blog series on Audio AR, with this first article exploring the basics!

A grid of four images showing scenes where (1) person working on a machine (2) a child wearing glasses holding a small robot (3) a woman listening to music on headphones (4) a woman working repairing an aircraft
By combining the benefits of spatial audio and other sound manipulation techniques with visual task tracking, future Audio AR technology will pave way towards revolutionary guidance systems for physical tasks such as repairing, building, cooking, and others

What is Audio AR?

Audio-only Augmented Reality (audio AR or AAR) superimposes audio information into people’s physical 3D environments, for the purpose of providing information, steering attention, and directing action. Audio AR is ushering a new wave of interactive experiences for situations where your visual channel is either busy with intensive work (e.g., painting, cooking, rock climbing) or is limited with little or no visual abilities (e.g., people who are blind or have low vision). The former is known as having a ‘situational impairment’ — a person performing a physical task has their hands and eyes occupied (e.g., the distracted driver in the figure below). In these scenarios, using their auditory channels to present critical guidance will free up their visual channels to focus on the task at hand.

Image a grid showing different types of permanent, temporary, and situational impairments across four senses: touch, see, hear, speak. For example, See and permanent = logo of a blind person; See and situational = logo of a distracted driver looking away from the road.
Examples of different types of permanent, temporary, and situational impairments across the four senses: touch, see, hear, and speak. Source: Microsoft’s Inclusive Design 101 Guidebook

Audio AR is an interesting facet of augmented reality at large. AR’s defining characteristic is the correspondence between the physical environment and computational augmentation in response to a user’s local activity and movement. Visual AR overlays holographic images or information onto the user’s physical 3D world, viewable through their smartphone or AR glasses. In contrast, Audio AR augments their acoustic environment with audio objects creating a virtual soundscape for the user wearing an audio AR-enabled hardware. The current advancements in networking, sensing, machine perception, and modeling will be instrumental in enabling augmented audio experiences to benefit people who are physically or visually engaged, by having the audio information adapt to the person, location, or situation.

Audio AR Experiences and Hardware: Past & Present

Audio AR experiences have taken numerous shapes and forms in the past, with early examples from the mid-90s and early 2000s. They ranged from immersive experiences in museums and theaters (e.g., Audium) to geo-locative navigation experiences (e.g., Microsoft Soundscape) to audio-only or mixed reality audio games. However, most of these experiences were limited in scope. With new hardware capabilities, we can start to think beyond these momentary immersive experiences to full-scale real-world physical tasks.

In the recent years, AR hardware has seen several investment bursts. The recently launched Apple Vision Pro headset features an advanced ambient spatial audio system. Towards the ambitious goal of audio-only interactions, in March 2018, Bose announced the “world’s first audio augmented reality platform” with a suite of audio AR-enabled products with spatial audio capabilities — Noise Cancelling Headphones 700, QC35 Headphones II, and Bose Frames — and audio AR SDKs. Other commercial audio AR compatible technology include smart earbuds (e.g., Nuheara), hearables, and Personal Sound Amplification Products (PSAPs). These devices offer features such as noise cancellation, directional focus, speech in noise control that “gives you the control to hear what you want to hear in the world around you (Nuheara)”.

Two part image. Upper half shows Apple’s ambient spatial audio system. Bottom half shows an image showing Bose AR logo with the sidebar highlighting three components: Aware Audio Devices, Head movement sensor array, and Developer SDK, libraries, and tools
Two prominent examples of major investments in AR tech from hardware giants: Apple and Bose. Source: Apple Vision Pro | Bose 2019 Hackathon Guide

New Interaction Dimensions via New Technological Advances

While Audio AR has been around for some time, particularly for geo-locative storytelling, recent technological advancements have enabled new interaction dimensions that rely on more intelligence and adaptation in the pairing of sound, activity, and tasks. Computer vision with audio guidance, for example, can enable a person with vision impairment to be warned auditorily about dogs, cars, or tripping hazards with the assistance of a head-mounted or body-mounted camera. Similarly, advancements in ultra-wideband geolocation make it possible to track moving objects in indoor and outdoor spaces. Audio AR can help draw attention to salient objects, or to help people understand second-order factors, such as whether the movement is accelerating or decelerating. Physiological sensing and modeling has advanced to the point that adaptive alerts are customized so that they alert people to information that they need to know, but do not add unnecessarily to their cognitive load. This enables profiling and augmenting the situation awareness of people, for example, while they are working with heavy machinery or driving.

Advancements in computational editing can also influence the sound being used in Audio AR. For example, recent tools in computational video editing help editors to automatically match sounds that fit the environment or mood in a setting in a video clip. In an AR environment, this sound generation capability can help scale the audio-augmentation of diverse situations as it removes the need to explicitly program an audio track for every situation.

These advancements contribute to the possibility of using Audio AR, for example, to actively guide people performing skilled worker tasks with a hands-free interface, alerting them to salient features in the environment, or conversely, muting distractions, and helping them identify and remedy errors in their task execution.

Audio-only interactions can be a cognitively less taxing alternative for when visual channels are engaged or saturated. However, using audio channels as the primary feedback mechanism and streamlining the cognitive experience, we have to rethink the ways of human interactions.

Where are we now?

Despite the promise of audio AR-enabled applications, the audio AR vision has not been truly realized to date. Instead, we see examples of companies like Bose abandoning the audio AR ship. While the concept of audio-only interactions was well received, the lack of compelling and interesting applications for everyday use led the enterprise to be shut down. In this blog series, we will examine the elements as well as research needed to realize the ambitious vision, while emphasizing the complexities involved in doing so. We hope to reinvigorate the excitement around this technology, given current advancements in software, hardware, and research efforts.

Next up in the Series

We will first delve into the acoustic environment: what does sound as a channel provide, how do we sense the existing audio space, and how we build on top of it. Next, we will cover acoustic digital twins and the promise of a holistic acoustic experience with examples of enabled applications. Finally, we will conclude with a few calls-to-action and potential directions for the academic and commercial tech communities to invest towards research and development efforts for furthering the audio AR vision.

Note on Accenture Labs:
This blog series is part of our task guidance research at the Digital Experiences team at Accenture Labs and our collaboration with Wendy Ju’s lab at Cornell Tech. This work is also a continuation of our past efforts around guidance for home healthcare tasks (CARE). To learn more about the ongoing work at Labs broadly, check out our Accenture Labs Notebook Medium publication!

--

--

Manaswi Saha
Labs Notebook

HCI Research Scientist at Accenture Labs | Comp. Sci. PhD from University of Washington | Interests: HCI, AR, visualization, urban informatics, and social good