Approaching Sound Event Detection as a Multiple Instance Learning Problem

CDS’s Brian McFee, Moore-Sloan Data Science Fellow, and Juan P. Bello, Associate Professor of Music and Music Education, develop more efficient SED methods with support from the Moore-Sloan Data Science Environment at NYU

Sound event detection (SED) is the task of labeling audio segments according to the presence of specific sounds. For example, a SED system could be tasked with labeling YouTube clips for the presence of car horns honking. SED can be a dynamic or static task; dynamic SED identifies sounds at precise moments in time within a recording, while static SED refers to the binary task of determining whether or not a specific sound is present within a clip.

While researchers typically approach SED as a supervised machine learning problem, this requires detailed annotations for the presence or absence of each sound at specific time instances — a laborious, expensive task. To reduce the human labor involved with SED and improve scalability, CDS’s Brian McFee and Juan P. Bello, along with senior researcher Justin Salamon, decided to treat SED as a multiple instance learning (MIL) problem. For this approach, training labels are static: they indicate the presence of sounds within short segments but not precise timestamps.

The model, however, must still produce dynamic results with precise timestamps. To achieve this, the researchers built a family of adaptive pooling operators — which they call auto-pool — that help the model adapt to particular sounds in different samples, enabling the model to make dynamic predictions based on aggregates of static samples.

They trained and evaluated their auto-pool SED method with three types of sound: urban soundscapes, traffic noises, and musical instruments. Their datasets included URBAN-SED for urban soundscapes (10,000 ten-second synthetic soundscapes), DCASE 2017 for traffic noises (50,000 ten-second YouTube clips from a challenge at DCASE 2017), and MedleyDB for musical instruments (531 multitrack recordings).

The researchers evaluated the performance of their auto-pool method by comparing it to three non-adaptive methods and two other adaptive methods, a constrained auto-pool method and a regularized one. All models were trained on “mini-batches” of 16 ten-second patches for each dataset. For the urban soundscape and musical instrument datasets, they were also able to compare the performance of auto-pool to a model trained with strongly labeled sound events, but this was unavailable for the traffic dataset.

Based on the comparative evaluation, the unregularized, unconstrained auto-pool method consistently outperformed other methods for the static SED prediction task and performed as well as a model trained with strongly labeled data.
For the dynamic task, however, the auto-pool method did not match the performance of alternative methods across the three datasets. Instead, the regularized auto-pool method was among the best performing across all datasets due to its ability to adapt to different sound characteristics with mean-like behavior.

The researchers note that the most important outcome of their study, regardless of any one model’s performance, is that sound event detection can be approached as a multiple instance learning problem. As SED research advances, the MIL approach will continue to produce methods that reduce the human effort required for SED tasks.

by Paul Oliver