How We Achieved Robust Sound Event Detection in Data-Scarce Scenarios

Introduction

Published in

Gong Tech Blog

5 min readOct 4, 2023

Sound Event Detection (SED) is crucial in speech recognition, environmental monitoring, and surveillance systems, among other applications. The ability to accurately identify and classify different sounds in real-world environments makes it possible for these systems to provide valuable insights and improve decision-making processes. For instance, in speech recognition, SED can help identify background noise or music, while in environmental monitoring, it can detect specific events such as wildlife activity or industrial machinery malfunctions. At Gong, we rely on SED to be able to capture speech events in sales conversations and provide sales teams with AI-driven insights and learnings they can put into action for future deals.

However, detecting various sound events in data-scarce scenarios, with limited labeled data, still poses a real challenge. To address this, at Gong we have developed a novel deep learning-based SED system that leverages self-supervised contrastive learning and offers significantly improved accuracy, efficiency, and robustness.

Overcoming Limitations

Traditional SED systems rely heavily on labeled data, which can be limited and time-consuming to obtain. Supervised SED systems work by training a model on a dataset containing audio samples with corresponding labels, such as the type of sound event or its location. The model learns to recognize patterns in the audio data that correspond to specific sound events. However, obtaining a large amount of labeled data for various sound events can be challenging, as it requires manual annotation by experts, which is both labor-intensive and time-consuming.

Self-supervised contrastive learning is a machine-learning technique that allows models to learn useful representations from data without relying on explicit labels. Instead, the model learns by contrasting different views of the same data, such as different augmentations of an audio sample. This approach enables the model to learn meaningful features from the data, which can then be used for downstream tasks, such as sound event classification.

To overcome the limitations of traditional SED systems, we present a system that operates at the frame level, allowing for more precise event detection. Through self-supervised contrastive learning, inspired by SimCLR [ref], it learns sound event representations by contrasting augmented views of sound events. This unsupervised pre-training helps mitigate the impact of data scarcity and enhances robustness against noisy labels.

Our SED system consists of two main stages: audio representation learning and supervised sound event classification. In the audio representation learning stage, we employ Mel spectrogram analysis and an encoder with Depthwise Separable Conv2D layers, LSTM layers, and a fully connected layer.

Mel spectrogram analysis

Mel spectrogram analysis is a widely used technique for extracting features from audio signals. It represents the power spectrum of an audio signal on a Mel scale, which is a perceptually motivated frequency scale that better reflects human auditory perception. This analysis allows our system to capture relevant information about the sound events while reducing the dimensionality of the input data.

Depthwise Separable Conv2D layers

Depthwise Separable Conv2D layers are a type of convolutional layer that reduces the number of parameters and computational complexity compared to standard convolutional layers. This is achieved by separating the spatial and channel-wise convolutions, resulting in a more efficient and lightweight model architecture. In simpler terms, these layers perform convolution operations on individual channels of the input data, followed by a pointwise convolution that combines the outputs from the previous step. This approach reduces the computational cost while maintaining the ability to capture spatial and channel-wise information.

Semi-supervised learning combines the benefits of both supervised and unsupervised learning, allowing models to leverage weakly labeled and unlabeled data for supervised learning tasks in data-scarce scenarios. This approach can significantly improve the model’s performance and generalization ability, as it can learn from a larger amount of data without requiring extensive manual annotation.

In semi-supervised learning, the model is first trained on a small amount of labeled data and then fine-tuned using a combination of labeled and unlabeled data. The idea is that the model can learn useful features from the unlabeled data, which can then be used to improve its performance on the labeled data. In our system, an online teacher generates reliable pseudo-labels for weakly labeled and unlabeled data, significantly improving the system’s performance and generalization ability.

Results and Conclusion

The new SED system surpasses the previous system in multiple aspects:

- 11% relative improvement in speech event detection F1 scores

- 47% relative improvement in music event detection F1 score

- A sevenfold improvement in runtimes

- A 2.1% relative reduction in Word Error Rate (WER) in the Automatic Speech Recognition (ASR) task

In Gong, the SED is the first stage in the speech recognition and transcription pipeline. It is responsible for speech event segmentation to avoid processing non-speech segments. During this stage, Gong also identifies other events like “music,” “Interactive Voice Response (IVR),” and “garbage (background noises)” to give customers insights or statistics about the call itself, such as how much speech or music (hold/waiting music) was in the call. It’s especially important to catch events with IVR to avoid processing them because the speech in those segments is synthetic and unnatural.

Our work on harnessing self-supervised contrastive learning for robust sound event detection in data-scarce scenarios demonstrates the potential of combining deep-learning techniques with innovative methodologies to overcome the challenges faced by traditional SED systems. By using self-supervised learning, semi-supervised learning, and efficient model architectures, we have developed a system that significantly improves accuracy, efficiency, and robustness in various sound event detection tasks. This not only helps to advance SED technology but also serves as a valuable resource for software engineers and researchers looking to enhance their understanding of machine-learning concepts and apply them in real-world applications.

At Gong, this has led to an improvement in both ASR and speaker diarization performances, as well as a significant improvement in execution time. We’re excited about the additional research opportunities these findings may reveal and how they can ultimately lead to more advanced and reliable systems for speech recognition, environmental monitoring, and surveillance applications, and particularly better results for our customers.

If you have questions or are interested in learning more about how we have implemented this at Gong, please feel free to reach out to Sagy Harpaz, Senior Speech Researcher, at sagy.harpaz@gong.io.