Locating Moving Objects Using Stereo Sound Instead of Visual Input

Nov 4, 2019 · 3 min read

Object localization involves predicting the location of a moving object within a scene. Not surprisingly, researchers have tended to rely on visual data as input, which together with some physics understanding will generally enable a machine to perform the task. This camera-based approach however can be compromised by low light conditions, fog, occlusions, etc.

In a bid to improve object localization in such less-than-ideal circumstances, an MIT and IBM research group has proposed a cross-modal auditory localization framework that can effectively locate objects using stereo sound.

Although vision is humans’ go-to sense for understanding environments, we instinctively draw on additional senses when vision is insufficient. Auditory cues can play a huge role for example in localizing an approaching ambulance in a busy street or a meowing cat in a dark room. Sound localization and cross-modal learning are research directions that aim to augment machines’ abilities in this regard.

Sound localization uses microphone arrays and beam-forming to analyze delays in a sound received by differently positioned microphones and estimate the location of the object emitting the sound. Because audio-visual data contains a wealth of resources for knowledge transfer between different modalities, cross-modal learning is a also a growing research area.

The MIT and IBM paper proposes a framework comprising a “teacher” vision network and “student” stereo sound network. The student network attempts to mimic teacher network outputs by transferring object detection knowledge across modalities during training. The vision network detects an object in a video and marks it with a bounding box, then the stereo sound network learns to map audio signals to the predicted bounding box coordinates. In the final inference mode, the student network directly predicts an object’s location using sound, without any visual inputs.

Network structure and training and testing procedure for cross-modal auditory localization
Average Precision (AP) and Center Distances (CD) results for cross-modal auditory localization
Cross-modal auditory localization improves object localization under poor lighting condition

The researchers’ proposed algorithm outperformed all audio only baselines in experiments on over 3000 video clips. Particularly under poor lighting conditions such as nighttime scenarios where traditional visual tracking systems struggle, it would appear cross-modal auditory localization has the potential to make significant contributions to object localization techniques and visual tracking systems.

The paper Self-supervised Moving Vehicle Tracking with Stereo Sound is onarXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any stories. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

Need a comprehensive review of the past, present and future of modern AI research development? Trends of AI Technology Development Report is out!

2018 Fortune Global 500 Public Company AI Adaptivity Report is out!
Purchase a Kindle-formatted report on Amazon.
Apply for Insight Partner Program to get a complimentary full PDF report.


Written by


AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global


We produce professional, authoritative, and thought-provoking content relating to artificial intelligence, machine intelligence, emerging technologies and industrial insights.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade