Audio AR — Part 2: Acoustic Sensing

Published in

Labs Notebook

6 min readAug 4, 2023

Co-authored by David Goedicke, Wendy Ju, and Mike Kuniavsky

Sound is consequential — most actions that take place in the physical world produce sound. From dropping a pen to an air conditioner starting up, sound tells us something about what’s happening in our environment, and does so without a direct line of sight.

The avant-garde composers of the 1950s and 1960s experimented with the very fundamentals of music and sound, breaking down traditional rules and schemas. In particular, John Cage, an American composer and music theorist, focused a part of his work on the use of consequential sounds and the absence of deliberate sound generation. One of his most famous and controversial pieces was the 1952 composition 4’33” with deliberate absence of intentional sounds for a total of 4 minutes and 33 seconds. The declared goal was to sharpen the audience’s attention to the many environmental and consequential sounds around us.

While this original piece was rather controversial, the intended lessons from the performances still stand to this day. We are immersed in a “loud” environment, and we draw information about the acoustic context constantly. Being conscious of the value of sound can help us better interpret the environments we find ourselves in and make better choices about how to acoustically augment them.

In this second article of the Audio AR blog series, we will explore what role does acoustic sensing play within the larger context of Audio Augmented Reality (Audio AR).

We will first understand how to extract the value of sound through state-of-the-art acoustic sensing techniques.
We will then explore how we can create a full immersive audio AR experience by transforming sounds and integrating them with the other data streams and the latest audio AR hardware, setting an outlook towards acoustic digital twins.

Sound signal shown in a sound editing software — Sounds embed information about the world | Photo by Peter Stumpf on Unsplash

Listen before you Talk: Detecting and Processing Sounds

As we attempt to augment the acoustic experience of people, we first want to understand what information is already present in the acoustic environment. This is both in terms of using the information for constructing context as well as questions around acoustic interface design.

Besides pure information content, sound has certain characteristics that makes it useful and ideal over visual information for detection and monitoring tasks. First, sound is omnidirectional, meaning that a generated sound will travel in all directions equally¹, similar to an omnidirectional microphone. Second, sound is easily reflected. Especially from flat and hard surfaces, sound gets reflected and travels around corners.

¹ Obstacles, resonances, and reflections notwithstanding.

Both of these features, however, bring some challenges for sound classification. First, identifying the sound source can be particularly difficult, especially if multiple similar noise generators are in the environment such as similar alarm sounds from machines in hospital rooms. Second, the mixing of different sources and reflected sounds makes the analysis difficult. Thus, we require more simplified application contexts to deploy reliable sound classification algorithms, such as picking only a few classes of sound to detect or only one specific room to work in. The sound classification app Merlin Bird ID by Cornell Lab is an example of such a constrained classification system “in-the-wild’’. It allows a user to record a second of audio, select the area in which the algorithm should detect a bird call, and then tries to identify the bird responsible for the recorded bird call.

Other specialized forms of sound classification also exist. For example, specialized contact microphones that are directly attached to a machine or motor can pick up changes in vibrations. This can be useful to monitor usage and raise early maintenance alarms for the monitored device. In general, algorithms that extract information from sound cover a wide range of areas: from detecting singular events and continuous performance monitoring to a full contextual understanding of how operational changes take effect in a larger environment.

Furthermore, acoustic sensing and classification, akin to computer vision in its early stages, has garnered considerable attention over an extended period, establishing it as a noteworthy computational task alongside visual information processing. An example is the conference IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, that hosts a variety of dataset-based challenges for audio detection.

Multi-modal design space for an creating immersive Audio AR experience

To achieve a comprehensive, fully immersive, acoustically augmented experience, it is crucial to integrate the acoustic context information from the sound classification algorithms with additional sensor data streams and audio AR hardware. For example, the results of acoustic detection via event detection and localization can be combined with features from other information systems such as spatial tracking, computer vision, thermal imaging, and the like.

Modern AR devices already include the hardware required to use computer vision to detect objects and spatially reference themselves, as it is part of producing visual AR content. For example, the recently unveiled Apple’s AR headset, Vision Pro, incorporates advanced audio processing capabilities to localize the headset and detect the acoustic qualities of the environment. These kinds of technological capabilities and advancements are crucial to localize, detect, and augment sounds, and to realize the Audio AR vision.

A major challenge in the field remains the integration and combination of these sensor streams to generate situational awareness in specific application environments. Open questions include, what are the possible ways in which we can affect what a user hears, how that influences their perception, and how we can use this to guide attention and help task guidance and completion.

Hide-Amplify-Augment Framework: Designing Audio AR

Detection of sounds and understanding environmental context is just half the game. The more interesting bit is what we can do specifically with the acoustic data already in the environment. From a naïve ideation, we have three options. First, we can attempt to hide sounds from the environment, leaving room for other, more important sounds and more room to pay attention. Second, important sounds could be amplified through a band-pass filtering to emphasize certain tones/sounds. Finally, certain elements could be augmented by additional sounds played through the audio AR device.

To illustrate how these three approaches could play a role in various settings:

Hiding. The ability to hide certain sounds coming from the environment could be essential especially when the wearer finds themselves on a noisy factory floor with constant background noises from machines and warning sirens. An audio AR device could be employed to identify the origin of these noises and to suppress the distracting sounds to allow for more focused work by the wearer. The headset itself can take over the monitoring task to detect critical conditions through sound, and turn off the acoustic damping when a critical situation occurs.

Amplifying. Let’s consider a construction-site scenario where safety is a top priority. Important warning signals, such as alarms or sirens, need to be clearly perceived by all workers around to ensure their safety. Other sounds for amplification could be acoustic stress indicators of materials or equipment. Detecting and amplifying some of these key sounds would make it easier to recognize them amidst a confusing construction environment, leading to a more subconscious safety signaling and a more aware work staff.

Augmenting. Sounds could find its way into training simulations for medical professionals, to help trainees learn certain acoustic signifiers quicker through augmented training scenarios. With the headset’s capabilities to track task processes, it could provide real-time audio augmentation to guide their attention. Additional audio cues could also include guidance or instructions to augment the task. By augmenting the training environment with relevant audio cues, the audio AR device enhances the trainee’s learning process and helps them improve their skills and proficiency in a controlled and immersive setting.

An additional example for augmenting the human listening perception through a headset would be by sonifying otherwise inaudible data streams. This might be by making ultrasounds human-readable. Other ideas could revolve around sonifying other sensor probe’s information.

Next in the Series

In this article, we explored the idea of extracting information out of sounds, the relevant techniques, and the underlying design approaches to integrate audio AR applications, such as hiding, amplifying, and augmenting. However, to make these applications work reliably for the different tasks, it is crucial to have a consistent and functional digital twin of the application environment. A digital twin is a virtual representation that mirrors the physical world, allowing for design, tracking, and system analysis. In our next article, we will delve deeper into the concept of acoustic digital twins, examining their significance for Audio AR, and direct what requirements exist for an acoustic digital twin.

Note on Accenture Labs

For an introduction to Audio AR, check out the first article of this blog series. This blog series is part of our task guidance research at the Digital Experiences team at Accenture Labs and our collaboration with Wendy Ju’s lab at Cornell Tech. This work is also a continuation of our past efforts around guidance for home healthcare tasks (CARE). To learn more about the ongoing work at Labs broadly, check out our Accenture Labs Notebook Medium publication.