Week 5 — Audio Emotion Recognition System (Part III)

Ece Omurtay
bbm406f19
Published in
3 min readJan 3, 2020

Hello everyone ! This is our fifth blog about our machine learning project. Last week, we gave some details of our project. This week, we share the results of methods that we use.

Let’s start!

animated rnn & lstm from Google

Firstly, we trained our model by using LSTM. We used 9 LSTM layers and each of them has 50 units, ‘adam’ as optimizer, ‘sparse categorical cross entropy’ as loss function, epoch 100, return sequence=true, softmax as activation function in last layer. In dense layer, we used 9 as units parameter because we have 8 different labels. Average result is 0,35. (mfcc feature as input)

Secondly, we created CNN models with both sound amplitude images and mfcc features as input. The average result of CNN model with mfcc is 0.67. We used 4 Conv1D layers, 4 activation layers with ReLU as activation function, 1 Max pooling layer, 1 flatten layer, 1 dense layer and softmax activation function in last layer. Learning rate is 0.00001 and epoch is 100.

CNN with mfcc feature

Then, we created CNN models with sound amplitude images as input. The average result of CNN model with images is 0.35. We used 3 Conv2D layers, 2 Max pooling layer, 1 flatten layer, 1 dense layer and softmax activation function in last layer. Epoch is 50.

sample of sound amplitude image

Then, we used Random Forest, Gradient Boost, Catboost as classifier. In order to use these machine learning methods, we took mean values of “mfcc”, “chroma_stft”, “chroma_cqt”, “chroma_cens”, “rms”, “spectral_contrast”, “spectral_bandwidth”, “tonnetz”, “zcr”. Our accuracy results are:

Random forest : 0.43

Gradient Boost : 0.34

Catboost : 0.42

0 : ”Neutral”, 1 : ”Calm”, 2 : ”Happy”, 3 : ”Sad”, 4 : ”Angry”, 5 : ”Fearful”, 6 : ”Disgust”, 7 : ”Suprised”

* 42 = “the answer to life the universe and everything“ and also our model’s random state :)

Thank you for reading ! See you next week !

from Pinterest

--

--