We Need to Take Advantage of Auditory Perception — Part 1: Ear vs Eye

Shahid Karim Mallick
6 min readJun 17, 2016

--

One of the phrases I like to use a lot when talking about interface design is “engaging more of our sensory spectrum.” This multimodal approach (using multiple sensory modalities, or senses) is crucial if we want to interact with digital information the way we interact with objects in the real world. I believe natural, fluid interfaces will establish a better connection between mind and computer and allow us to better express and perceive our thoughts and ideas.

(link)

A big part of our sensory engagement is auditory perception.

Sound is a hugely important source of information — however, it is massively underutilized. According to Adam Somers, lead audio engineer at Jaunt VR, “Audio sometimes takes a back seat in content production.” We typically see audio as supplemental to visual information, and only give it attention when using it for dialogue or music. We talk about data and content visualization, but almost never about data and content sonification.

Why is auditory engagement important? How does it compare to visual perception? How can we use it to better understand complex ideas and data sets? Why does this matter?

Well, in a nutshell: our brains are incredibly good at processing sound. We can hear frequencies from ~15–20,000 Hz, and can differentiate over 1300 of them. We can comfortably hear across a range of 120 dB. Auditory perception even has advantages over vision when it comes to timing, spatial range, and reaction speed.

These features make audition a highly robust way to deliver information — one that we are just beginning to take advantage of.

Why does auditory processing matter?

I’m not going to delve too much into HOW our nervous system processes sound: here’s a good resource for that. Instead, I’m going to talk about WHY auditory processing is incredibly valuable, the features that make auditory perception effective, the advantages it has over vision, and how it could help us design better communication tools.

Attentional Selectivity

Our ears are remarkably good at filtering out noise and focusing on a particular soundstream. In crowded rooms, we’re able to block out surrounding conversations and pay attention to a single person — dubbed the “cocktail party effect.”

(link)

Test how well your ears can distinguish voices here (courtesy of UCSF).

Or the next time you’re at a restaurant, try this experiment (works better if you close your eyes): try and focus in on conversations around you, making your way from table to table. See how clearly you can discern each conversation and how easily you can switch to the next one. It’ll surprise you how powerfully selective your ears really are.

Of course, this selectivity is surprisingly open, allowing us to distribute our attention and listen to multiple soundstreams at the same time — aka parallel listening (Fitch & Kramer, 1994). Have you ever heard your name in a nearby conversation while talking to someone else? Try the experiment again, but this time see how many simultaneous conversations you can comprehend.

Temporal Resolution

Interestingly, our auditory system is actually better than our visual system at timing. We can distinguish between clicks that are only tens of microseconds apart (Krumbholz et al., 2003; Leshowitz, 1971). On the other hand, our visual system has much poorer temporal discrimination; the fastest flicker rate it can discern is at most 40–50 Hz, i.e. images that are ~2 hundredths of a second apart (Bruce, Green and Georgeson, 1996). For reference, the minimum refresh rate of most televisions and lightbulbs is 50–60 Hz. (i.e. lightbulbs may be flickering 60 times per second but because we can’t distinguish those flickers, we observe a continuous stream of light, or a fluid movie scene).

Spatial Resolution and Range

A simple explanation of how our ears localize sound (link)

On the other hand, our visual system has significantly better spatial discrimination, aka contrast sensitivity. We can distinguish points that are only 1/30th of a degree apart in our visual fields, as opposed to the less precise spatial resolution of our ears, which is approximately 1 degree.

Nevertheless, our sense of hearing has a distinct spatial advantage over our sense of sight: it is completely spherical. We can localize sounds originating from anywhere around us, provided they are in the right frequency/intensity range. In contrast, we can only see what is right in front of us, and can only fixate on what falls in our fovea (center of the visual field). So although vision has higher spatial resolution, audition has a much wider spatial range.

Reaction/Response Time

Usain Bolt actually has one of the slowest reaction times at the Olympic level (link)

Multiple studies have shown that the auditory system also reacts faster than the visual system, although tactile response is actually the fastest. Studies also found that men reacted faster than women, athletes were faster than non-athletes, and interestingly, computer-savvy people were faster than those without similar experience.

Try for yourself!

Visual: http://cognitivefun.net/test/1

Auditory: http://cognitivefun.net/test/16

The difference is mainly due to signal transduction (conversion of sensory stimuli into electrical signals). Sound travels much slower than light, but as long as the stimulus isn’t coming from too far away (>27m), our auditory system will react and generate a response faster than our visual system can.

Why Not Both?

Studies on audio-visual interactions show people react faster to combined stimuli than when presented with a single stimulus. When auditory and visual signals are combined, the response is faster than either one can generate alone. This improvement is especially pronounced when the two signals are spatially co-located.

Spanish footballer David Silva performing a reaction test (link)

This synergistic effect is also visible in the case of sensory memory. Researchers are interested in looking at how well we remember information that comes into our different sensory modalities. For instance, are we better at recalling visual or auditory information? What about information that has both visual and auditory components; and does it matter if they are redundant or unique? Are the effects different based on the type of information being perceived? Are they different for short-term vs long-term memory?

We are still searching for definitive answers to these questions, but a clear trend has been observed: redundancy is good. The more sensory modalities that are used to encode information, the better that information is detected and remembered.

Motion represents yet another way that our brains benefit from audio-visual interactions. Motion is concurrently processed by visual and auditory streams, even using shared neural real estate, implying that both visual and auditory systems need to be engaged to fully conceptualize motion.

There are many more examples of the advantages of audio-visual engagement in the brain, such as attention and object recognition. All these studies underscore the concrete benefit of multimodal interfaces. By creating computing interfaces that are as sonically beautiful, complex, and appealing as they are visually so, we can create a more efficient way to deliver information.

Read Part 2 of this post here (HOW to take advantage of our auditory perception)

Research for this post came mostly from the book Helmet-Mounted Displays: Sensation, Perception, and Cognition Issues, fully available here, and the articles and papers cited/linked throughout.

--

--

Shahid Karim Mallick

I build natural interfaces (see my latest work at smallick.com). Studied neuroscience @BrownUniversity. Product @NeoSensory, previously @CalaHealth