Speech Emotion Recognition using Deep Learning

Toshita
4 min readDec 8, 2023

Speech emotion recognition is a task that requires processing audio with a human voice to recognize the emotional state of the speaker. Usually, multimodal systems are used to combine text and audio processing. Conventional methods include the use of deep learning models like the use of convolutional neural networks, recurrent neural networks, audio spectrogram transformers, etc.

About the dataset

The CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset) is a comprehensive and widely used dataset in the field of affective computing, specifically designed for research on emotion recognition from audio and visual cues. Created by the University of California, Los Angeles (UCLA), the dataset is unique due to its emphasis on multimodal emotional expressions, incorporating both facial expressions and vocalizations.

The dataset is stored in a specific format. The name is specified in four blocks which are split by an underscore. The first part is the actor ID, the second block is the sentence ID which demonstrates which sentence has been spoken by the actor, the third block is the emotion and fourth block is the emotion level/intensity. For example, one file name is 1001\_DFA\_ANG\_XX.wav. 1001 denotes the actor ID, DFA is the sentence spoken “Don’t forget a jacket”, ANG represents the angry emotion and XX means unspecified intensity.

Spectrograms

Although there are conventional approaches to classify audio using text transcription, emotion recognition needs to go beyond these techniques. Audio features such as Mel-frequency cepstral coefficients (MFCCs) and spectrograms capture the emotional content directly from the acoustic signal, providing a richer source of information than transcribed text alone. Emotions are often expressed through non-verbal cues in speech, such as changes in pitch, tempo, and rhythm. These cues are inherently present in audio features but may be lost when relying solely on transcribed text. Also, audio features are more cross-linguistically applicable since they are based on universal acoustic patterns of speech. Transcribed text may lose nuances in emotional expression that are specific to certain languages or cultural contexts.

We first process our raw audio files and convert them to spectrograms to take complete advantage of these differences and identify emotional cues from spoken data.

Let’s start modeling!

We used two models here :

  1. Convolutional Neural Networks

The input dataset was split into train and test with a split ratio of 90:10.

The labels were encoded such that every emotion had a number associated with it and the input MFCC data was scaled.

The model with 4 1-D convolution layers, kernel size of 5 and varying number of filters was then trained for 200 epochs using the categorical cross entropy loss.

The activation used was ReLU and a dropout of 0.1 was added between the dense layers of the fully connected part of the network.

The output layer has 6 possible classification outcomes, each for one of the 6 emotions to be predicted.

The final layer uses the softmax activation to output a single label for every input audio.

2. Long Short Term Memory network

The first step taken to implement the LSTM model is the creation of a dictionary mapping emotion labels to numerical values.

This is done by using the librosa library to extract the MFCC features that will become the numerical values.

Then, the dataset is split using 12% of the data for training and the remaining 88% for testing and validation, converting the categorical labels to one hot encoding format.

The model is a sequential LSTM with three layers having a softmax activation output layer.

The model is compiled with categorical cross-entropy and RMSProp optimizer. The model is trained for 200 epochs with a batch size of 6.

Results

For the CNN model, we can see the loss progression here:-

For the LSTM model, we can see the loss progression here,

The above is a confusion matrix generated for LSTM model. It shows the range of emotions confused with some and the correct amount of the predictions for the CREMA-D dataset

The 1-D CNN performs the best for anger emotions while the LSTM model performs the best for sad and angry emotions.

You can follow and learn more about this project here:

Acknowledgment

This project has been fulfilled under the guidance of Professor Abhijit Mishra, School of Information, The University of Texas at Austin.

Members of the team:

Daniela Lizarazo — Student, The University of Texas at Austin.

Kritika Patade — Student, The University of Texas at Austin.

Toshita Sharma — Student, The University of Texas at Austin.

--

--