Week 4 — Audio Emotion Recognition System (Part II)

Ece Omurtay

Published in

bbm406f19

3 min readDec 27, 2019

Hi everyone !

This is our fourth blog about our machine learning project. This week, we want to give some details of our project.

Let’s start!

a brief recap

The main goal of this project is recognizing the emotions from audio files. Data consists of 1440 audio samples. There are 24 different actors -half of them is male and rest is female- and each have 60 samples. In each sample, different emotion is emphasized. There are 8 different emotions in total.

What is LSTM ?

LSTM is abbreviation of Long short-term memory. LSTM model is a method type of neural network used in especially speech recognition which has good results. LSTM networks are modified version of recurrent neural networks (RNN). It trains the model by using backpropagation. It was developed for vanishing gradient* problem. It protects the loss value that comes from layers in backpropagation. In LSTM, there are some structures (yellow boxes) that follows each other. The yellow boxes are learned neural network layers. These boxes give outputs in range 0 to 1. 0 means “let nothing through” and 1 means “let everything through”.

*Vanishing gradient : In backpropagation, when updating weights, loss value decreases gradually but gradient can approach to zero. So, it makes hard to find the correct value.

What is RNN ?

In RNN, output is produced by combination of current inputs and past inputs. Differ from feedforward, it uses the output as next step’s input. It stores past input data that has a meaning for the output. Output that comes from hidden layer is written to units named as content units. So, feedforward is not enough for this type of storing. Following equation is used for calculating:

→ Where ht is result of hidden layer at moment t, Wxt is dot product and Uh(t-1) is dot product of weight U and h(t-1) which is stored in content unit. Addition is given to activation function like sigmoid or tanh.