Comparison between simple audio classification methods

David Puiggròs
6 min readSep 19, 2021
Photo by Matt Botsford on Unsplash

In this post I want to show a comparison, neither exhaustive nor particularly complete, of various methods for classifying sounds using deep learning. I have long wanted to review some of these methods in order to compare their performance and how easy is to apply them.

The exercise I propose is quite simple. The main objective is get a basic comparison of these methods and picking up some key ideas.

Using a dataset of 1.500 audio files labeled with 5 categories, I will train two classifiers, each using a different technique, to predict their audio categories.

I will explore two options based on neural networks: use the signal audio directly (one dimension) or through an image (two dimensions) generated from a spectral analysis of the original signal.

Dataset

The dataset used in this experiment can be found in https://www.kaggle.com/c/freesound-audio-tagging/data and was published by Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline. In Proceedings of DCASE2018 Workshop, 2018. URL: https://arxiv.org/abs/1807.09902.

All 9.473 audio files in the dataset are gathered from Freesound and are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. The ground truth data provided in this dataset has been obtained after a data labeling process in 41 categories of the AudioSet Ontology. Only 40% of data have been manually verified.

In this experiment I only use 5 categories associated to these sounds: Bark, Cough, Knock, Laughter, Trumpet.

The distribution between training, validation and test sets are showed in the next table.

Sample distribution between categories and datasets

There are between 239 and 300 sample per category. I use this distribution for each dataset type: 20% test, 16% validation and 64% training.

Dataset regularization

The audio files (in wav format) in the dataset have different duration and this requires some regularization that I apply the same way for each analyzed option in the same way. To accomplish this I used Fastaudio library.

I will only use 5 seconds of audio data per file. When the audio is longer, I crop 5 seconds signal. If its smaller I repeat until the duration is 5 seconds.

Before that cropping I remove the silence, so I try to use the most important information of the audio signal.

The signal croping it’s also an augmentation, because when the signal lasts more than 5 seconds I crop different parts of the signal in any training epoch.

Some improvement could be done here using all the available data of the audio or determining a better window for each audio domain (5 second is quite arbitrary decision).

In the next images it is shown one audio sample, the removed silence version and a cropped/repeated version.

Original audio
Same audio without silence
Same audio repeated until 5 seconds duration reached

Strategy to solve the problem

A possible strategy (a very interesting one), which I will not work on in this post, is to apply feature-engineering from the signal. For example, it’s possible that some classes use fixed frequencies, or some duration… Then it would be possible to generate custom attributes to facilitate classification.

I will not use this alternative as I don’t know this domain well enough, nor it is my intention to work this part of the analysis. What I’m looking for, is an approach that saves me from working this knowledge without a significant cost in accuracy.

The first alternative, raw classifier, is to build a deep learning model that use the signal audio as is. These neural network will have two components:

  • A convolutional layer that it’s responsible to generate interesting features of the raw data
  • A full connected layer that will classify using the previous features

The second alternative, image classifier, consists in representing the audio data as an image, using a spectrogram representation, and then try to classify it using a pre-existing deep learning model specialized in object recognition.

Raw classifier

For the raw classifier I use a 1-dimension convolutional layer. It’s not as common as it’s 2d colleague, but we can see some examples in audio classification problems like: https://medium.com/@oknagg/gender-classification-from-raw-audio-with-1d-convolutions-969c82e6b3d1

After some tests (not too many since the aim of this experiment is to compare the two strategies without applying too much effort or refinement) I’d got the following structure with the best results.

Conv1dModule(
(m): Sequential(
(0): Sequential(
(0): ConvLayer(
(0): Conv1d(1, 8, kernel_size=(4096,), stride=(1000,), padding=(2047,), bias=False)
(1): BatchNorm1d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
)
)
(1): Flatten(full=False)
(2): Linear(in_features=640, out_features=500, bias=True)
(3): ReLU()
(4): Dropout(p=0.4, inplace=False)
(5): Linear(in_features=500, out_features=50, bias=True)
(6): ReLU()
(7): Dropout(p=0.4, inplace=False)
(8): Linear(in_features=50, out_features=25, bias=True)
(9): ReLU()
(10): Dropout(p=0.4, inplace=False)
(11): Linear(in_features=25, out_features=5, bias=True)
)
)
Total trainable params: 2,718,091

The results, after some training:

These results are not really good. Some classes are predicted well (Trumpet), but it’s easy to get confused by Bark, Cough and Laughter.

I’m sure that this model has a lot of room for improvement. It’s possible to use a better architecture, augmentations, or fine-tuning parameters like convolutional size or stride.

Image classifier

The image classifier requires that we convert the 1-dimension audio data to a 2-dimensional data. We use the AudioToSpec class in fastaudio to generate the image spectrogram. The spectrogram is a representation that shows the power of the sound in each frequency, through time.

Spectogram of audio file showed previously

To improve the performance of this model we can use a really simple augmentation technique. We can cover some parts of the spectrogram in each training sample. This technique will improve de model generalization.

Augmentation covering some random time interval and frequencies

Here we can see some examples of the original audio data, and the spectrogram used to train de model:

I use transfer learning from a pretrained xresnet18 model to classify the spectrogram images. The network has more than 11 million parameters available for this classification task.

Results are far better. There is some confusion between Bark, Cough and Laughter but the 86% accuracy it’s pretty nice.

I think it’s really interesting how it’s possible to fine-tune a vision model specialized in common objects recognition to work with something so particular as spectrograms.

Conclusions

It seems pretty obvious, using this experiment, that a first approximation using images is not a bad idea.

Probably, the most important features in audio data are the power frequency of the signal. And the spectrogram is providing this information ready to be consumed. The one-dimension model has to generate it, with no clues, and sure it requires more examples and more fine-tunning.

The number of trainable parameters in each model it’s substantially different and that can also explain the better performance of the image model.

Next steps

  • Improve the raw classifier using augmentation, or using transfer learning techniques
  • Train the first layer of the raw classifier using autoencoders, to build some kind of “language model” (audio model) that improves the classification of the full-connected layer
  • Use LSTM or some other method that allows the use of all the signal (including silences), regardless of the duration

--

--