Audio AR — Part 3: Acoustic Digital Twin

Sonically modeling the physical environment

Manaswi Saha
Labs Notebook
6 min readSep 18, 2023

--

Co-authored by Wendy Ju, Mike Kuniavsky, and David Goedicke

Avery, the autonomous vehicle, is driving through Manhattan, New York City. Avery is equipped with cameras, but the city is busy, crowded, and the halal carts are hard to see through. Still, Avery is able to anticipate what is coming down the road through the use of GPS and maps, and by listening carefully to the environment. Even though there is a school nearby, Avery can tell there are no kids around — maybe it’s a holiday? — so Avery can drive more quietly, without having to do as much to draw attention to his presence. On the other hand, Avery’s acoustic sensors can hear there’s a hubbub coming down 2nd Avenue, so Avery’s navigation system pilots him down 1st avenue instead.

What is an Acoustic Digital Twin?

In this fictional scenario, the autonomous vehicle is using an acoustic digital twin to augment its other sensors to help navigate New York City. If you recall from our previous blog article, a digital twin is a virtual representation that mirrors the physical world, allowing for design, tracking, and system analysis.

An Acoustic Digital Twin (ADT) includes a sonic profile of the environment that supplements the model for the objects, people, and activities that might take place in a physical space. More importantly, ADT semantically interprets and spatially maps the sounds to help the autonomous car understand the dynamic state of the space, such as where activity is currently happening, or what the people in the space are currently doing.

The acoustic digital twin also helps the autonomous car predict the sonic effect of its own actions, so that it could adapt its car horn sound and volume, for example, so that it is not too loud, but clearly distinguishable from background sounds.

ADTs of workspace, cities, or other locations, include not only the fixed and transient sources of audible phenomena — traffic, playgrounds, crowds — but also information about the acoustic surround that might affect how new sounds are amplified or echoed in that environment. This enables both modeling and simulation of sonic phenomena for design, but also for staging experiments and enabling planning.

Person wearing a headset looking towards a traffic intersection. Buses and buildings are seen are in the background
Acoustic digital twins (ADTs) sonically model the physical environment and enable seamless augmented audio experiences to support tasks such as navigation | Photo by Henry Be on Unsplash

The concept of “twinning” dates back to the early NASA era, in which replicas of spacecraft on earth helped NASA scientists to study, to test and to predict actions of spacecraft in space. Digital twins started to pop up in the early 2000s as a means to simulate and test new products, using virtualized rather than physical replicas. Digital twin projects have largely focused on digital models of the buildings and large scale structures in the physical world, to the exclusion of many other important factors–even persistent environmental objects such as fire hydrants, stop signs, or trees are often left out. Due to the temporal nature of sounds, ADTs necessarily include more dynamic phenomena and some model of the temporal-spatial dimensions of that phenomena. As digital twins are increasingly applied towards processes and towards generating synthetic data for AI applications, the inclusion and understanding of auditory information will play a growing role.

What are the enabling technologies?

Acoustic digital twins are enabled by the synthesis of a number of digital technologies: large-scale 3D modeling, localization and mapping technologies, spatial audio input, acoustic inference, feature extraction, semantic modeling, and activity modeling.

Large-scale 3D modeling: Large-scale 3D modeling incorporates both 3D models of large objects–cars, ships or skyscrapers, for example, but also of the urban landscapes. The physical spatial landmarks provided by the large-scale 3D modeling are often the foundation of the viewable and inspectable landscape of the digital twin, which the acoustic and data parts of the digital twin are situated within.

Localization and Mapping: Localization and mapping technologies help to map the digital and physical environments, and enable the accurate tracking of physical objects and placement of digital objects in the shared digital and physical space. Because the audio sources and displays correspond to physical locations in space, this mapping is an important part of any ADT, even if the audio does not have a physical presence.

Spatial Audio Input: Spatial audio models audio sources in a location in the digital space, and makes it so that the perceptual rendering of the audio environment accounts for their bodily location and their head orientation with respect to the sound source.

Acoustic Inference: Acoustic inference takes audio input from the actual physical environment and uses that to detect and model objects, activities, and features of the physical space.

Semantic Modeling: Semantic modeling helps translate physical and digital objects and activities to human-meaningful names or identifiers, for the purpose of enabling discourse and interaction about objects in the augmented space. For the ADT, the semantic model is the difference between a dog-bark sound and the recognition and labeling of a sound as a dog bark.

What are useful applications?

Acoustic digital twins can be nice for enriching models of physical digital twin, so that people have a full sensory experience while interacting with the simulated model. However, ADTs are particularly useful in contexts and applications where sound can help a digital system differentiate between critical states, and in situations where a machine needs to function differently depending on the sonic landscape. For example:

  • Robot Adaptation. Robots communicate their intended actions through the use of sonic cues, but the appropriate volume and sounds for those cues actually depends on what else is going on in the landscape. A model for the sonic landscape can help robots to communicate more clearly with people near them. Conversely, artificial agents might proactively leave sounds in the acoustic digital twin to let people know, for example, where their travel lanes or where obstacles exist that they have issues with.
  • Sound landmarks. We hear in 3D. Just as we have visual objects as landmarks, we can have virtual sonic landmarks that are fully 3D objects with variation and nuance. People traversing in a physical space might hear sonic events and landmarks to help guide their traversal — e.g., they might be cued that there is a waterfall before they can see or even hear the real thing. The sonic landmarks might also pertain to historical landmarks or past events which do not currently have a physical presence in the space. These virtual sound landmarks would be presented relative to a person’s goals or interests, and may also require some muting or masking of the current sounds in the actual space.
  • Sonified data layers. Following delivery instructions while riding a bicycle through the city requires the rider to constant shift visual attention from traffic to map to the delivery app. ADTs enable an audio AR experience that removes visual attention shifts by maintaining a constant dialogue with the rider about salient information, whether that’s the next turn or the truck that’s coming up behind on the left. To a large extent, this is the kind of experience that mobile phones with earbuds already provide, but with contextual AI models and spatial audio we expect that the sophistication of both the instructions, how they’re presented, and the ability of people to interact with them will greatly increase.
A delivery robot on a sidewalk is seen in the foreground with a building and a grassy land in the background
As an example, ADTs enable delivery robots to navigate the physical world using sonified data layers and sound landmarks | Photo by Bill Nino on Unsplash

What does this mean for Audio AR?

With the existence of an acoustic digital twin, a truly audio-based augmented reality experience can be created. By mapping the virtual digital soundscape to real-world physical locations, ADTs can help not only make a more complete digital twin environment but also to enrich the virtual landscape with data that is interpretable in a non-focal manner. This in turn can support richer interactions, improved guidance and support, and greater potential for the digital twin to serve as a prototype for learning about and planning for our physical surroundings. ADTs can help shape and improve soundscapes by guiding the design of physical spaces to create desired auditory experiences.

The Audio AR blog series covered the basics around this emerging technology. In the near future, we will write more on Audio AR and its role towards spatial computing. We will delve deeper into the future of Audio AR technology and how we are using this technology to shape novel auditory experiences in our research group at Accenture Labs.

Note on Accenture Labs

This series is part of our task guidance research at the Digital Experiences team at Accenture Labs and our collaboration with Wendy Ju’s lab at Cornell Tech. This work is also a continuation of our past efforts around guidance for home healthcare tasks (CARE). To learn more about the ongoing work at Labs broadly, check out our Accenture Labs Notebook Medium publication.

--

--

Manaswi Saha
Labs Notebook

HCI Research Scientist at Accenture Labs | Comp. Sci. PhD from University of Washington | Interests: HCI, AR, visualization, urban informatics, and social good