Freesound Audio Tagging — Recognizing Sounds of Various Natures

A Machine Learning (Multilabel Classification) task based on Freesound Audio Tagging 2019 Dataset on Kaggle

Divyansh Jain
The Startup
21 min readDec 21, 2020

--

Table of Contents

  1. Business Task
  2. Use of Machine Learning/Deep Learning in this task
  3. Evaluation Metric
  4. Exploratory Data Analysis
  5. Preprocessing and Featurizations
  6. Existing Solutions
  7. Modeling
  8. Error Analysis
  9. Summary
  10. Future Work
  11. References
Photo by Aaron Burden on Unsplash

1. Business Task

The objective of this case study is to develop a general-purpose audio tagging system, i.e., when the algorithm is given an audio clip as input, it detects which sound(s) are present in it.

This case study is based on the famous Kaggle competition: Freesound Audio Tagging 2019.

Source: https://www.kaggle.com/c/freesound-audio-tagging-2019

These sounds are represented as categories/labels and there are 80 of these in this competition.

For eg., in this sound clip, it can be clearly identified that it is a baby’s laughter. However, there can be multiple sound labels in a single clip, for example, this clip is from a restaurant during busy hours. Here, there are a lot of distinct sounds like people talking, keys rattling, cutlery sounds, etc. We’ve to identify all these sounds.

2. Use of Machine Learning/Deep Learning in this task

As a machine learning solution to this problem, we need to develop a model that takes the audio clip as input and predicts all the categories/labels of sounds present in it.

This can be modeled as a multilabel classification problem in machine learning. Here, for a given clip there are multiple correct labels (as there can be multiple sounds present in one clip) and the model should be able to identify the maximum of these.

There are 80 predefined labels/categories like Guitar and other Musical instruments, Water, Respiratory sounds, Human voice, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds.

Machine Learning is particularly helpful here to reduce human efforts in manual labeling of these sounds for tasks like annotating sound collections and providing captions for non-speech events in audiovisual content. It can also be used to automatically tag video content or recognize sound events happening in real-time.

Dataset Source and Description:

The dataset used for training the model is the Kaggle competition’s publically available dataset.

The training dataset is composed of two subsets:

  1. Curated subset:

This contains around 5,000 clips of manually labeled data. It contains minimal noise and hence various sounds can be distinctly heard in each clip.

Example datapoint is:

Example datapoint (curated)

In the sound clip for the above datapoint, the labels “Crowd” and “Cheering” can be distinctly heard.

2. Noisy subset:

This contains around 20,000 clips of noisy web audio data where the background noise is significant and all sounds can not be heard distinctly.

We are also given two mapping files, train_curated.csv, and train_noisy.csv, which contain the filename of the .wav file along with the actual class labels for curated and noisy subsets respectively.

Example datapoint is:

Example datapoint (noisy)

In the sound clip for the above datapoint, the label “Crowd” can’t be identified distinctly. Instead sounds related to “Guitar” are more distinctly present in this clip.

Thus, we have 80% noisy - 20% curated data at the clip level.

3. Performance Metric

The performance metric used here is Label Weighted Label Ranking Average Precision (LWLRAP).

This measures the average precision of retrieving a ranked list of relevant labels for each test clip (i.e., the system ranks all the available labels, then the precisions of the ranked lists down to each true label are averaged).

The novel “label-weighted” part means that the overall score is the average over all the labels in the test set, where each label receives equal weight (by contrast, plain LRAP gives each test item equal weight, thereby discounting the contribution of individual labels when they appear on the same item as multiple other labels).

Some key features of LWLRAP are:

  1. It ranges between 0 and 1, the higher being the better.
  2. An LWLRAP score of 1 means, that if there are ‘k’ ground truth labels for a given clip, then if we sort the model’s predicted probabilities (for each of the 80 classes for that clip) in descending order, the first ‘k’ labels are exactly the same labels that are present in the ground truth.
  3. The LWLRAP starts decreasing if non-ground truth labels have a higher probability than the ground truth labels, i.e., they’re ranked higher than the ground truth labels when the probabilities are sorted in descending order.
  4. It is calculated on the basis of the relative ranking of the labels and not the actual probability scores.
  5. The relative frequencies of the labels are taken into account and a weighted average (Label Weighted LRAP) is performed instead of a simple average (LRAP). This results in assigning appropriate weights based on the frequency of occurrence so that the less frequently appearing labels don’t get an undue advantage.

LWLRAP is explained very well in this Kaggle Kernel.

4. Exploratory Data Analysis (EDA)

4.1 EDA on number of clips per label

This is an important analysis to know if the dataset is balanced or not.

Just a quick recap of the basic statistics:-

  • Curated training dataset: 5,000 clips
  • Noisy training dataset: 20,000 clips
  • Number of class labels: 80
Bar plot of the count of clips per label (Curated dataset)
Bar plot of the count of clips per label (Noisy dataset)

Observation: The dataset is balanced where we roughly have the same number of clips for each of the 80 class labels (in both the datasets).

4.2 EDA on number of labels per clip

Since this is a multilabel classification task, a given clip can contain more than one type of sound and hence one clip can have multiple labels simultaneously. For eg., this clip is taken from the curated dataset and contains 3 labels: “Crowd”, “Race_car_and_auto_racing”, and “Accelerating_and_revving_and_vroom”.

Bar plot of the number of labels per clip for the curated dataset
Bar plot of the number of labels per clip for the noisy dataset

Observation: A majority of the clips (~84%) have a single label, a few clips (~13%) have 2 labels and a very few have 3 labels (~2%). This encompasses 99% of the total clips.

4.3 EDA on the duration of clips

This is an important analysis because if the clip is 1–2 seconds long, then even a human will have difficulty in finding the correct label. Also, if a clip is very long, then there are chances of too many different sounds getting mixed in it, thus, increasing the complexity of the task.

Get clip length (in seconds)
  1. Curated dataset
Statistics of clip duration for the curated training dataset
PDF of the duration of clips for the curated dataset

Observations:

1. Audio clips from the curated dataset range from 0.3–30 seconds with a median length of 4.7 seconds.

2. The distribution is heavily skewed with the majority of clips being less than 3 seconds long while there are few clips over a minute long also.

3. There is a lot of variance in the lengths of clips from the curated dataset.

We randomly select 5 labels and then check the variation of duration across these labels.

Label-wise distribution of audio lengths among 5 randomly selected labels from the curated training dataset

Observation: Few labels (eg. hi-hat) have very less variance in audio lengths, while few labels (eg. bus) have high variance. Overall, there are different variances in the audio length across different labels in the curated dataset.

2. Noisy Dataset

Statistics of clip duration for Noisy training dataset
Histogram of clip durations (Noisy dataset)

Observation: Clips here range in between 1–16 seconds with a lot of clips (~90%) being close to 15 seconds in length. The variance here is much smaller than in the curated dataset.

4.4 EDA on Amplitude of clips

This is an important analysis because sound waves with large amplitude have more energy, hence, these sound louder and louder sounds are easier to classify than softer sounds.

Code snippet to obtain peak and average amplitudes
PDF of peak amplitudes for Curated training dataset
PDF of peak amplitudes for Noisy training dataset

Observation: Most of the clips (in both the datasets) have a peak amplitude of 30,000.

PDF of average amplitudes for Curated training dataset
PDF of average amplitudes for Noisy training dataset

Observation: The average amplitude follows a distribution close to Power Law distriubtion. That is, most of the waves have the average amplitude close to 0 and the count reduces significantly and quickly for higher values of the average amplitude.

So, the overall conclusions from EDA are as follows:

  1. We have roughly the same quantity of data per label, so the dataset can be considered as balanced.
  2. The curated dataset consists of clearly audible clips, but, we have only 60 clips per class for the curated dataset. On the other hand, the noisy dataset’s clips contain a lot of background noise and are not clear, but, we have these in abundance, i.e., 250 per class. So, either we develop a model that can predict well with less data or we have to somehow clean the noisy data so that the labels become clearly distinguishable.
  3. Most of the clips contain only one label per clip, which is a good thing for our models.
  4. The majority of the clips in the curated dataset are short in length (<5 seconds), this can be challenging because predicting the type of sound from a very short clip (1–2 seconds) is difficult for a human also. While, in the noisy dataset, almost all clips are 15 seconds long, but, these contain a lot of background noise.
  5. Most of the clips have the same average amplitude which indicates nearly the same average loudness level for all the clips.

5. Preprocessing and Featurizations

Here, we transform the data into a suitable format, so that it can be fed to a machine learning model. It is analogous to bathing (cleaning) a baby (data), dressing him up in school uniform (featurization), and then sending him to school (model).

Source: https://www.alamy.com/

We have all our data in the format of audio files only, hence, we perform the following featurizations:

  1. Removing leading and trailing silences

If we listen to this clip (taken from the curated training dataset) casually, we would not be able to find any distinct sound throughout the clip. However, on listening carefully, we will find some distinct sounds in the middle part of the clip.

There are many such clips where the starting and ending of the clips contain sounds below 60 dB (normal loudness level), which we can trim so that the clip becomes more precise.

2. Resampling the audio clip

The sampling rate (SR) of a sound is the number of samples collected per second while recording the sound. A higher SR means more data collected per second of the clip and hence better sound quality.

The current SR is 44.1 kHz, i.e., 44,100 samples per second, which is the standard SR for digital recordings. The clips are of fairly good quality in the curated dataset. One issue that can arise later while modeling is that, the dimensionality of the data becomes very high if we choose an SR of 44.1 kHz. Although deep neural networks don’t have a problem with high-dimensional data, this significantly increases the time required to train the model.

One idea which is inspired by this Kaggle kernel is to reduce the SR from 44.1 kHz to 16 kHz. This led to a 2.75x decrease in the model’s training time while the performance was not affected that much. Hence, we made this tradeoff to cater to real-world time constraints.

3. Random offsetting and padding

After removing leading and trailing silences, the lengths of audio clips vary from 0–30 seconds. We decided to take all clips of the same length, i.e. t seconds, so, for clips longer than that, we choose a random sample of length t seconds, and for clips shorter than that, we pad the clip symmetrically with zeros on either side, so that the final length becomes t seconds. By hyperparameter tuning, we found the appropriate t to be 15 seconds.

Random offsetting and padding serve the following two purposes respectively:

  1. Selecting a random sample helps in reducing overfitting. This idea is similar to the idea of dropping some random neurons (Dropout) while training deep neural networks or selecting a random subset of features/datapoints for each individual base learner in a Random Forest model. The objective of all of the above methods is to introduce randomizations so that the model becomes robust to the noise present in the data and does not overfit.
  2. Padding is necessary to make all the data of uniform dimensions so that it can be fed to any model. The simplest form of padding is zero-padding where the original data is padded by zeros symmetrically on both the starting and trailing ends.
Config class: Used to share global parameters
Preprocessing function — 1

6. Existing solutions

The following Kaggle kernels/research papers were quite good:

  1. Beginner’s Guide to Audio Data 2: They used good featurizations like using a lower sampling rate, random offsetting, padding, etc. On top of that, they built 2 models: a 1-D CNN model on raw time series and a 2-D CNN model on log Mel-spectrogram features.
  2. Audio Tagging With Noisy Labels And Minimal Supervision: They used good featurizations like log Mel-spectrogram features, used the curated and noisy dataset uniquely, and then built a 2-D CNN model which gave an LWLRAP score of 0.546 on the test set.

7. Modeling

Initially, we build our models using the curated training dataset only as these clips are quite clear and audible. Also, the data is less (~5000 clips) as compared to the noisy set (~20,000 clips) and thus the models will be trained faster.

Later on, we will try training the models on the complete dataset (Curated + Noisy) and check if there’s any significant increase in the model’s performance.

Since we have time-series data, we’ll try deep learning models as these are quite strong in extracting hidden features from raw data, given there is enough data.

7.1 Train-Validation split

We perform a 70–30 train-validation split which is not random, but, stratified such that both train and validation dataset contains approximately the same distribution of the number of labels per clip. The above strategy helps to reduce the train-validation set differences.

Stratified Train-Validation Split

We can verify if the split works as expected:

Get the percentage of clips having a given number of labels in the input dataframe
Distribution of the number of labels in train and validation dataframes

As we observe, the train and validation datasets have the same distribution of the number of labels per clip.

Since this is a Kaggle competition, we have a separate test dataset (~3,300 clips) which we will use to evaluate each model’s performance.

7.2 Data preparation for Model-1

After this, we read the contents of the preprocessed .wav files produced in Step 5 using scipy.io.read(). We can have multiple ground truth labels for each clip which we convert into multi-hot-encoded vectors.

All the data is converted into TensorFlow Datasets format. This allows proper utilization of both CPU and GPU so that the idle time for each of these is minimum (this link can be referred to for more advantages of TF Datasets).

The detailed code for conversion of the .wav files into TF datasets can be found on my Github Profile.

As mentioned in Step 5, we used a sampling rate of 16 kHz (16,000 samples per second) and sampled/padded each clip to the same length of 15 seconds. So, each clip/datapoint now contains 16,000 samples/second * 15 seconds = 2,40,000 samples in total. Each sample here becomes a feature for the ML model.

Example datapoint

7.3 Model-1 (1-D CNN)

We use a 1-dimensional convolutional model (architecture is inspired by this kernel). It performs temporal convolution (along one axis only) on raw time-series features. We prefer to use 1-D CNN over RNN/LSTM because these are much faster to train and give better results. Max-pooling layers are also used to downsample, so as to reduce overfitting. Appropriate dropouts are also used for the same purpose. ReLu activation function is used everywhere except the last layer where we use the Sigmoid activation function since we want independent probabilities for each class.

Since this is a multilabel classification task where multiple labels such as “Crowd”, “Cheering”, and “People” can be present in the same clip, we use sigmoid and not softmax, the main difference between these two is, that sigmoid predicts the probability of the clip belonging to each label independently, i.e., the clip can have a probability of belonging to class “Crowd” equal to 0.7 and at the same time, the probability of it belonging to the class “Cheering” can be 0.6, while, softmax assumes each clip belongs to only one label (that is, all probabilities are dependent on each other so that the total sum is 1). This is a very nice article explaining the differences between sigmoid and softmax.

We use the Categorical Crossentropy loss function, a batch size of 64 along with the Adam optimizer to train the model.

Model-1 (1-D CNN)
Architecture of Model-1

We train the model for 50 epochs, store the weights after each epoch, and observe the train and validation curves to be as follows:

Epoch vs Loss (Model-1)
Epoch vs LWLRAP (Model-1)

From the curves, 20 epochs seem enough as after that loss curve starts flattening and the gap between train and validation loss starts increasing.

We load the weights after the 20th epoch and then compute the predictions on the testing data, the final results which we get are as follows:

Train LWLRAP: 0.546
Validation LWLRAP: 0.492
Test LWLRAP: 0.368

Test score for Model-1

The testing LWLRAP is decent which means the model has decent prediction power. However, it is slightly lower than the train and validation LWLRAP, this could be due to the test set differing from the curated training dataset.

7.4 Data preparation for Model-2

In Model-1, we used the raw time-series data directly read from the input .wav files with slight preprocessing (removing silences, padding, etc.). Although deep learning models are able to automatically extract features from the input data up to some extent, these will work better if we input properly featurized data instead of raw data.

The ideas for these featurizations are inspired by this research paper.

Before the Deep Learning era, people developed techniques to extract features from audio signals. It turns out that these techniques are still useful. One such technique is computing the MFCC (Mel Frequency Cepstral Coefficients) from the raw audio. Before we jump to MFCC, let’s talk about extracting features from the sound.

If we just want to classify some sound, we should build features that are speaker-independent. Any feature that only gives information about the speaker (like the pitch of their voice) will not be helpful for classification. In other words, we should extract features that depend on the “content” of the audio rather than the nature of the speaker. Also, a good feature extraction technique should mimic human speech perception. We don’t hear loudness on a linear scale. If we want to double the perceived loudness of a sound, we have to put 8 times as much energy into it. Instead of a linear scale, our perception system uses a log scale.

Taking these things into account, Davis and Mermelstein came up with MFCC in the 1980s. MFCC mimics the logarithmic perception of loudness and pitch of the human auditory system and tries to eliminate speaker-dependent characteristics by excluding the fundamental frequency and their harmonics.

Here is a very nice blog explaining the Fourier transforms, Spectrograms, and the Mel Scale. I would strongly recommend everyone to give this beautiful blog a read.

We perform the following steps to get the Mel frequency features from the raw data:

  1. Sample the input into different overlapping windows. Since most of the audio clips are short, windowing the sound clip makes sure we don’t miss any useful information.
  2. Compute the STFT (Short Time Fourier Transform) for each window to convert the time domain into frequency domain.
  3. Generate a Mel Scale (The Mel Scale, mathematically speaking, is the result of some non-linear transformation of the frequency scale. This Mel Scale is constructed such that sounds of equal distance from each other on the Mel Scale, are also “sound” to humans as they are equal in distance from one another. In contrast to the Hz scale, where the difference between 500 and 1000 Hz is obvious, whereas the difference between 7500 and 8000 Hz is barely noticeable.)
  4. Generate Mel Spectrogram: Combine all the STFTs generate in step 2 to create a spectrogram such that the frequency and amplitude scale on the Spectrogram is of Mel-scale instead of the normal scale.
Example Spectrogram (Ref: https://en.wikipedia.org/wiki/Spectrogram)

In a spectrogram, we have both, the frequency and the amplitude (loudness) of a wave with each small window represented on the x-axis as time. Mel-spectrogram has a similar representation, except the fact that instead of normal frequency (Hz) and normal amplitude (dB), it uses the Mel-scale to represent these. This is because most sounds humans hear are concentrated in very small frequency and amplitude ranges which can be better represented using Mel-scale.

The code for the above transformations can be found on my GitHub page.

7.5 Model-2 (2-D CNN on Mel Spectrogram features)

Since we generate spectrograms (i.e. images) from the above featurization, the first idea is to use a Convolutional Neural Network (CNN) model which works very well on images.

We use the MobileNet model for the classification task.

Model-2 (2-D CNN)

We train the above model for 10 epochs and the train validation loss curves are as follows:

Epoch vs Loss (Model-2)
Epoch vs LWLRAP (Model-2)

We get sufficiently good validation performance after 9 epochs, also, the difference between the train and validation scores is not very high, so, we are sure that the model is not overfitting. We compute the test predictions using the best weights.

Final results:

Training LWLRAP = 0.743
Validation LWLRAP = 0.52
Testing LWLRAP = 0.514

Test score for Model-2

The 2-D CNN (MobileNet) model performs very well as compared to the previous model. The difference between training and validation LWLRAP is not too high. Also, the validation and test score are also very similar which indicates the model is performing well on unseen data as well.

Hence, we consider this as our final model.

8. Error Analysis

This is a very important and often overlooked part of the ML pipeline. We should know the cases where our model works and where it doesn’t. As a great little man once said, “Once you’ve accepted your flaws, no one can use them against you”.

8.1 Analysis of loss

We plot the Probability Density Function (PDF) of the categorical cross-entropy loss between the ground truth labels and predicted labels for the complete training dataset.

PDF of categorical-cross entropy loss for the training dataset

Observation: The distribution is similar to a normal distribution but with fat tails which means most of the errors are close to the average error while there are few points with quite high error.

We divide all the datapoints into 3 categories:

  • Best performance (least loss)
  • Average performance (average loss)
  • Worst performance (highest loss)

We divide the dataframe such that each of the categories has an equal number of datapoints.

Dividing the dataframe into 3 parts based on the loss

We plot the PDF of loss for clips having the highest loss:

PDF for clips with the highest loss

Observation: This is a highly positively skewed distribution where most of the values are less but there are few values that are very large. This indicates there are some clips for which the model is not able to predict the labels correctly.

8.2 Whether the number of labels per clip play a key role in determining the loss?

If there are 5 labels in a given clip, it would be tough for a human also to determine all of these correctly as compared to a single label. Hence, this seems to be an important factor.

Impact of the number of labels per clip on loss

Observation: In the clips having a high loss, there is a significant percentage of count_labels > 1, which indicates this plays an important role in the model’s performance, as we expected.

As the number of labels per clip increases, the performance decreases.

8.3 Does the duration (in seconds) of the clip have an impact on the loss?

PDF of the duration of the clip for each of the categories of loss (low, medium, high loss)

Observation: All the clips which are of long duration have high loss, this may be true because, if the duration is very long, there can be many different sounds in the clip and the model can’t predict all of them accurately leading to a higher loss.

8.4 Top 10 labels that occur most frequently in the highest loss category

This is important so that we know the labels for which model is not able to predict properly.

Counting the number of times each label occurs in the worst performance category

Printing the first 10 items of the worst_perf_labels_count_dict:

10 most freq. occurring labels in the worst performance category

For labels like “squeak”, the sound is very quiet, which even humans have difficulty in hearing, so, this is okay. Labels like “sink”, “bathtub” and “water_tap_and_faucet”, all are very similar sounds which might be the cause of misclassification.

Further error analysis can be viewed on my GitHub page.

8.5 Conclusions:

  1. The model performs well on the majority of the points. There are a few points for which the loss is high (an observation made from the fat tails of the normal distribution of the losses).
  2. The model performs well on clips with a single label per clip and where the clip has low to moderate background noise.
  3. The model fails to identify all the correct labels when the number of labels per clip is high (>2) and the clips contain various very similar sounds like water_tap_and_faucet, toilet_flush, drip, fill_(with_liquid), etc.
  4. The model also can’t perform well if the clip is too short (1–2 seconds) as even a human can’t detect the environmental sound present in such a short clip. The model sometimes has problems in identifying very long clips (>15 seconds) also where there are too many labels as background noise. The model predicts only the labels which can be distinctly heard and fail to identify the nuanced sounds. This behavior is also similar to human behavior.
  5. Ideally, the model performs best when the clip is of moderate duration and consists of a 1 label with less background noise. Increasing the number of labels does not cause a problem unless the background noise is so high that the nuance labels’ sounds are unable to be heard. Also, since all these sounds are environmental sounds, it becomes tough to correctly classify very similar sounds even for humans. The model behaves almost comparable to a human in identifying most of the clips.

9. Summary

  1. The business problem we had was to find all the different labels present in a given sound clip. We converted this to a multilabel classification problem (from an ML perspective).
  2. We had two types of datasets: Curated and Noisy. While the curated dataset was smaller, but, the clips were much clearer with very less background noise. The noisy dataset was larger but contained too much background noise and false labels. Due to this, we only used the curated dataset for modeling purposes.
  3. In the training dataset, we had nearly the same number of clips for each label (balanced dataset), most of the clips had a single label, were less than 5 seconds long, and had nearly the same loudness.
  4. The first model we built was a 1-D CNN model trained on the time series data with some basic preprocessing like trimming leading and trailing silences, reducing the sampling rate, random offsetting, and padding. It performed decently with an LWLRAP of 0.368 on the test set.
  5. The second and final model that we built was a 2-D CNN model trained on the featurized data, i.e., Mel Spectrogram data. This was obtained by dividing the clip into small windows, applying Fourier Transform, generating Mel scale, and finally plotting the spectrogram by combining all the transformed clips and using the Mel scale on the spectrogram. This model performed quite well with an LWLRAP of 0.514 on the test set.
  6. The final model performs well on clips with less to moderate background noise, single label per clip, average duration, has labels that can be clearly heard and are distinct from each other. It has some difficulty in prediction for clips with high background noise, more than 3 labels per clip, too short or too long clips, very quiet labels (like “squeak”), or similar sounding labels (“tap”, “sink”, “bathtub”).

10. Future Work

We can utilize the Noisy dataset as an initial training dataset so that the model is first trained on the noisy dataset, and then those weights are used as the initial weights for complete training on the Curated dataset. This two-step “warm-start” process can boost performance. However, due to computational constraints, it was not possible here.

Link to my profile:

The complete code can be found on this Github Link. You can connect with me on Linkedin. I can also be reached at divyanshjain.19@gmail.com.

Thank you for reading through this blog. I hope you have a great day :)

--

--

Divyansh Jain
The Startup

Experienced software engineer skilled in machine learning, programming, algorithm design, data handling, mathematics, probability, and statistics.