Enhanced Environmental Sound Classification with a CNN

End-to-End Environmental Sound Classification using a 1D CNN

Christopher Dossman
AI³ | Theory, Practice, Business
3 min readMay 3, 2019


This research summary is just one of many that are distributed weekly on the AI scholar newsletter. To start receiving the weekly newsletter, sign up here.

Convolutional Neural Networks (CNN’s) are very useful in image recognition and classification tasks. However, they also have recently had a significant impact on environmental sound classification which is critical in applications such as crime detection, IoT, environmental context-aware processing, etc.

Typical approaches for environmental sound classification, however, rely on handcrafted features or learn representations from mid-level representations such as spectro-temporal features. They first convert audio signals into a 2D representation (spectrogram) and use 2D CNN architectures initially designed for object recognition such as AlexNet and VGG.

With environmental sound classification, VGG 2D CNN has achieved good results. But the challenge in using 2D CNN’s, in this case, is that the modeling capacity of such networks is dependent on the availability of massive training datasets to learn kernel parameters without over-fitting. What more, there’s the problem of limited labeled environmental sound data.

New End-End Environment Sound Classification Model

A group of Canadian researchers recently proposed an end-to-end 1D CNN for environmental sound classification. The model comprises of 3–5 convolutional layers depending on the audio signal length. Instead of implementing 2D representations like many conventional approaches, the proposed 1D CNN learns the filters directly from the audio waveforms.

Classification accuracy of the proposed 1D CNN as well as the results obtained by other state-of-the-art approaches

On evaluation on a dataset comprising 8732 audio samples, the new approach demonstrated several relevant filter representations which allow it to outdo existing state-of-the-art methods that are based on 2D illustrations and 2D CNN’s.

Additionally, the model has fewer parameters than most of the other CNN architectures for environmental sound classification and achieves a mean accuracy of between 11 % and 27 % higher compared to conventional 2D architectures.

Potential Uses and Effects

Going by the evaluation results demonstrated in this research paper, the proposed approach has great potential to deliver robust environmental sound classification applications.

For starters, it is much efficient and requires little training data compared to conventional 2D CNN’s which demand millions of trained parameters. It also achieves state of the art performance and can handle audio signals of any length by implementing a sliding window. Finally, its compact architecture greatly minimizes computation costs.

Read more: https://arxiv.org/abs/1904.08990v1

Thanks for reading. Please comment, share and remember to subscribe to our weekly newsletter for the most recent and interesting research papers! You can also follow me on Twitter and LinkedIn. Remember to 👏 if you enjoyed this article. Cheers!