Effects of spectrogram pre-processing for audio classification

Published in

Using CNN to classify audio

5 min readJul 2, 2019

This article describes about pre-processing techniques which can affect the performance of machine learning models used to classify emotions from voice.

Emotion classification is important for many real world applications. This is the problem of observing a human being and try to understand if the person is happy, sad, angry …etc. For example a mobile robot which is talking to you may try to understand how you feel. After understanding this, it can adapt its behavior accordingly. Try to make you happy if you are sad.

A common technique used to classify emotions is to listen to someone’s voice and use a machine learning model to classify it. It is very common to convert the voice clips to spectrograms and then use a CNN (Convolutional Neural Network) to classify this. (Hershey, 2017) talks about using CNNs to do this. Although CNNs are traditionally used to classify images, recently we see an increase of CNN usage in audio classification. And they perform pretty well too. A spectrogram can be represented as a matrix similar to a grayscale image. Because of this, we can use the same methods we use to classify images to classify spectrograms.

We can use various pre-processing techniques to process the inputs to a neural network. This can potentially increase the performance of a model. The intuition is if we convert (pre-process) our input into a form that is easy for the NN to understand, it is easy for the NN to classify the input into the appropriate class. We are going to discuss three such methods to pre-process a spectrogram and their effect of a machine learning model.

Method 1 : min-max scaling

scaled_data = [data — min(data)]/[max(data) — min(data)]

This equation is applied to all the data of the spectrogram

Method 2: Z-score scaling

scaled_data = [data — mean(data)]/std(data)

This equation is applied to all the data of the spectrogram

Method 3: log scaling

scaled_data = log(data + 1e-10)

This equation is applied to all the data of the spectrogram

Here a small value (1e-10) is added to the data to avoid numerical issues related to calculating log (log(x) is undefined when x≤0). The result varies slightly with this value. You should experiment while changing this if you train your own models.

After each scaling, the data is rescaled to [0,1] (except for min-max scaling because its already done) to make sure the data input to the CNN is similar across the different scaling methods.

By observing the above plots we can see that log scaling emphasizes more features in the spectrogram. A related plot is the histogram of the spectrograms. This shows the occurrence of various amplitudes in the spectrogram. Log scaling gives the most Gaussian-like and spread histogram.

Next step is observing the effect of these normalization methods on the performance of machine learning models.

Details of the methods used

CNN architecture : mini-xception

Batch size = 5

Dataset = SAVEE (as found on http://kahlan.eps.surrey.ac.uk/savee/)

Emotions in SAVEE = [angry,disgust,fear,happy,neutral,sad,surprised]

Spectrogram window length = 1 s

Input to the CNN = 48 x 48 ‘images’ of spectrograms

SAVEE dataset contains emotional video clips acted by 4 native English speakers expressing 7 emotions.

Mini-xception CNN was introduced by (Arriaga, 2017) to classify emotions with facial images. Here we will use the same CNN to classify spectrograms. There is no particular reason for this choice except it is a fast CNN which performs pretty well with facial images.

Choosing a good batch size for training.

Batch size is an important hyper parameter in training a NN. To choose a good batch size we need to do some experiments while varying the batch size and observing the outcome. For this we use min max spectrograms and try to train the CNN while changing the batch size and observe the accuracy on the validation set.

From the experiments batch size = 5 is found to be performing well.

Training the CNN

Next we train 3 different mini-xception CNNs using the 3 different kinds of normalization techniques and obtain the accuracy over training epochs.

With min-max scaled spectrograms

2. With Z-score scaled spectrograms

3. With log scaled spectrograms

Conclusion

From the above three results we can see that the CNN trained with log-scaled spectrograms achieves around an accuracy of 40% while other two methods settle to around 25%. So at least for this example, log-scaling is a better method to normalize your spectrograms.

Find the code used to produce these and some results in my github :

sleekEagle/audio_CNN

This uses CNNs to classify vocal emotions. We first convert audio clips to spectrograms and use them to train a CNN…

github.com

Follow me on Twitter https://twitter.com/lahirunuwan

Linkedin : www.linkedin.com/in/lahiru-nuwan-59568a88

References

Arriaga, O., Valdenegro-Toro, M., & Plöger, P. (2017). Real-time convolutional neural networks for emotion and gender classification. arXiv preprint arXiv:1710.07557.

Hershey, S., Chaudhuri, S., Ellis, D. P., Gemmeke, J. F., Jansen, A., Moore, R. C., … & Slaney, M. (2017, March). CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 131–135). IEEE.

Effects of spectrogram pre-processing for audio classification

sleekEagle/audio_CNN

This uses CNNs to classify vocal emotions. We first convert audio clips to spectrograms and use them to train a CNN…

Written by Lahiru Nuwan Wijayasingha