Interpretability of Speech Emotion Recognition

Srikanth Konjeti

Published in

The Observe.AI Tech Blog

5 min readSep 19, 2022

This blog refers to our technical paper accepted at Interspeech 2022, Incheon, South Korea

Problem statement:

Interpretability of Speech Emotion Recognition systems are important to understand the reason behind emotions identified in a speech conversation. It is also important to identify and understand the variation of emotions within a conversation. We address the interpretability of our proposed Speech Emotion Recognition system and identify varying emotions within a speech utterance in our work.

Proposed approach:

Speech Emotion of an individual is characterized by both what and how the person speaks. While the transcribed text represents what a person says (represented by text embeddings), how the person speaks is determined by spectral variations in the speech signal (speech embeddings).

We propose SSPM-ATT (self-supervised pre-trained model with attention), an utterance level emotion prediction model using word-level text and speech embeddings derived from the pre-trained BERT and wav2vec models.

A WORD-SSPM-ATT model is derived as a downstream word-level emotion prediction model, using the frozen time gated LSTM (TGLSTM) hidden unit vectors of each word corresponding to SSPM-ATT as the input. The target class for the word-level prediction is the utterance level true class and a simple feed forward network is used to train the word-level model.

In the above figure (right) is the SSPM-ATT model, wav2vec and word embeddings are the word-level speech and text embeddings derived as mean pooling of frame and token embeddings. Left side of the figure is the WORD-SSPM-ATT model.

word-level prediction of emotions done using WORD-SSPM-ATT model has been used to derive interpretability measures and sub-utterance level emotion classification insights by us.

Interpretability measures for emotion insights

word-level attention weights across TGLSTM hidden units vectors of the utterance helps in interpretation of emotion predictions, quantified using our proposed interpretability measures on the attention weights.

We denote the Interpretability measures as a Human Level Indicator Matrix, as these measures can be used by humans to get insights into emotion predictions. Statistics of the attention weights and word-level emotion predictions are used to derive different interpretability measures:

Mean and standard deviation of attention weights
Class conditioned sum and entropy of attention weights for each class, where attention weights correspond to words predicted as a class using WORD-SSPM-ATT
Class conditioned words of importance is the list of words contributing

to each emotion class prediction

Class distribution: Count of number of words corresponding to each predicted class
Class entropy: Entropy of the normalised class distribution over all classes

These interpretability measures for each utterance constitute the Human Level Indicator Matrix. It helps us to understand and interpret the distribution of emotion predictions across words, which words are important for particular emotion class predicted and weightage each word has for the predicted classes.

Sub-utterance level emotion prediction

Any subsequence of words within an utterance is a sub-utterance. Consecutive word-level predictions are combined to predict emotions at a sub-utterance level, such that all word predictions within a sub-utterance have the same emotion predictions.

We have observed that word-level hidden unit vectors of TGLSTM in the SSPM-ATT model capture the nuances of emotions at a word-level.

The above figure shows the word-level emotion prediction and the corresponding attention weights, as well as how consecutive predictions of word-level emotions mapped to sub-utterance level emotion variability

Even though the word-level predictions are learnt using a target single emotion label for the whole utterance, the word-level hidden unit vectors of TGLSTM learn the emotion variability. The hypothesis supporting this is that hidden unit vectors for a word with similar contexts are similar even across utterances with different class labels, as the is context captured by the TGLSTM model.

Dataset

The database used for our experiments is the IEMOCAP database. The four classes are neutral (neu), happy (hap), angry (ang) and sad (sad). The text transcript and audio signal for each utterance is available in the database for extraction of text and speech embeddings. Utterance-level labels are available as the ground truth. We use the IEMOCAP database for analysis with respect to various interpretability measures and for reporting emotion recognition results.

Results

Utterance level emotion recognition

SSPM-ATT model trained on cross validation for five sessions of IEMOCAP database achieved an average accuracy of 73.04%, which is comparable to other state of the art emotion recognition approaches.

Sub-utterance level emotion recognition

Since sub-utterance level ground truth labels are not available, we randomly selected 20 utterances with correct utterance level predictions and having emotion variability within the utterance. The sampled utterances have been annotated by two human annotators to get 42 sub-utterances with emotion labels and segmentation timestamps. We achieved a percentage match of 73.53% (predictions vs ground truth human annotations) and Cohen’s Kappa Score of 0.55 for sub-utterance emotion prediction, while 60% and 0.66 is achieved for segmentation timestamps.

Insights using Interpretability measures

We have observed a higher class entropy indicates different emotion classes are blended within the same utterance and vice versa. Sampling utterances with class entropy between 0.0–0.2 gives an utterance level accuracy of 89.22% while the same drops to 42.70% for utterances with class entropy between 1.0–1.2. It can be concluded that confidence on utterance-level emotion predictions is higher for utterances with a lower-class entropy.

Class conditioned words of importance and the corresponding attention weights helps us to identify important words for the predicted sub-utterance levels of emotions.

Ablation studies

Mapping attention weights above its mean as zero reduces accuracy by 2.44% as compared to 0.21% when zeroing out the same for attention weights below its mean for an utterance.

This shows that the attention weights above mean are more important than weights below mean for emotion recognition.

Conclusion

We used the self-supervised speech and text embeddings to predict utterance-level emotions as a downstream task
We proposed SSPM-ATT and WORD-SSPM-ATT models to obtain speech emotions at an utterance, word and sub-utterance level and validated the performance
Results show that utterance-level ground truth can be used to get sub-utterance level prediction with word-level hidden unit vectors
Interpretability measures can be used to get emotion insights like important words, spread of different emotions within an utterance and confidence on utterance level predictions

Learn more about how we’re changing conversation intelligence for contact centers around the world at Observe.AI.