Implement a Speech Emotion Recognition (SER) Model

Published in

The Startup

3 min readJan 31, 2021

Data derived from speech has several application such as improving service at call centers and personalizing the user experience on speech-enabled devices. Detecting emotion from speech is critical in drawing meaning from this data. For this project, we’ll start with DataFlair’s implementation of the Multi-layer Perceptron (MLP) classifier, an Artificial Neural Network (ANN). The dataset comprises of 768 audio clips obtained from RAVDESS. Each audio clip is of an actor portraying an emotion, with the project honing in on clips that convey calm, happiness, fear, or disgust. I will focus here on my additions to the initial project. My fully implemented Jupyter notebook is available on my GitHub page.

To kick things off, let’s establish a baseline score using a dummy classifier. A baseline provides a measure against which to evaluate subsequent scores. Since this is a classification task, models are scored on accuracy. A stratified strategy means that our dummy predictions are based on the distribution of classes in the data.

The dummy classifier gives us a baseline accuracy score of 22%. When you implement this project, note that specific scores may vary slightly due to data sampling variations.

The DataFlair version of the model scored 62% in my implementation. While this score is much better than the baseline, let’s see if we can improve it (and therefore our predictions) by tuning the model hyperparameters.

Intuition and experience are very helpful when setting initial parameters. However tuning a model is largely an exercise in trial and error. Thankfully there are tools like GridSearch to help with this. We will start with tuning the following parameters:

Hidden layer sizes — this tells the model how many layers to add to the input and output layers, as well as how many neurons to include in each layer. It is defined as a tuple with each value representing a layer and the number of neurons to add. The initial model had one hidden layer with 300 neurons.
Activation — activation functions add non-linearity to the calculations carried out. Read more about activation functions here. Relu is the default.
Solver — the model aims to minimize the loss function (Adam solver is the default). As an example, the loss curves below were obtained from the initial model and the highest scoring (best) model. The best model has lower loss values and converges more smoothly, although it takes more iterations to converge.

When provided a dictionary containing the values for each of the above hyperparameters, GridSearch will run a cross-validated check with each setting and identify the best estimator based on the scoring metric provided.

Once the grid search is fitted to the training data, we can find the parameters that resulted in the highest score.

The first attempt at tuning points towards a logistic activation function and using 2 hidden layers. The accuracy score went up to 80% (from 62%). Let’s investigate whether or not adding neurons to each layer would yield another bump in accuracy.

Using a logistic activation function and two layers of 250 and 150 neurons each brings us to 82% accuracy, a significant improvement from our initial model at 62%!

Finally, I recommend using this methodology to practice tuning other hyperparameters such as alpha, batch size, and learning rate. In my experimentation, they did not yield an increase in accuracy but figuring out how to tune them will help build that intuition mentioned earlier.

Implement a Speech Emotion Recognition (SER) Model

Written by Njeri Gachago