Image for post
Image for post

Enhanced Environmental Sound Classification with a CNN

End-to-End Environmental Sound Classification using a 1D CNN

Christopher Dossman
May 3, 2019 · 3 min read

This research summary is just one of many that are distributed weekly on the AI scholar newsletter. To start receiving the weekly newsletter, sign up here.

Convolutional Neural Networks (CNN’s) are very useful in image recognition and classification tasks. However, they also have recently had a significant impact on environmental sound classification which is critical in applications such as crime detection, IoT, environmental context-aware processing, etc.

Typical approaches for environmental sound classification, however, rely on handcrafted features or learn representations from mid-level representations such as spectro-temporal features. They first convert audio signals into a 2D representation (spectrogram) and use 2D CNN architectures initially designed for object recognition such as AlexNet and VGG.

With environmental sound classification, VGG 2D CNN has achieved good results. But the challenge in using 2D CNN’s, in this case, is that the modeling capacity of such networks is dependent on the availability of massive training datasets to learn kernel parameters without over-fitting. What more, there’s the problem of limited labeled environmental sound data.

New End-End Environment Sound Classification Model

Image for post
Image for post
Classification accuracy of the proposed 1D CNN as well as the results obtained by other state-of-the-art approaches

On evaluation on a dataset comprising 8732 audio samples, the new approach demonstrated several relevant filter representations which allow it to outdo existing state-of-the-art methods that are based on 2D illustrations and 2D CNN’s.

Additionally, the model has fewer parameters than most of the other CNN architectures for environmental sound classification and achieves a mean accuracy of between 11 % and 27 % higher compared to conventional 2D architectures.

Potential Uses and Effects

For starters, it is much efficient and requires little training data compared to conventional 2D CNN’s which demand millions of trained parameters. It also achieves state of the art performance and can handle audio signals of any length by implementing a sliding window. Finally, its compact architecture greatly minimizes computation costs.

Read more:

Thanks for reading. Please comment, share and remember to subscribe to our weekly newsletter for the most recent and interesting research papers! You can also follow me on Twitter and LinkedIn. Remember to 👏 if you enjoyed this article. Cheers!

AI³ | Theory, Practice, Business

The AI revolution is here!

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store