Unsupervised Dimensionality Reduction in Speech Emotion Recognition

Theodore Giannakopoulos
Behavioral Signals - Emotion AI
4 min readOct 21, 2019

Speech Emotion Recognition (SER) focuses on automatically analyzing speech signals to extract the underlying emotions of the speakers. This is more a matter of terminology, but SER is not to be confused with Sentiment Recognition, which is text-based analytics applied either on text documents or on the output of ASR or STT (Automatic Speech Recognition, Speech To Text) Systems. SER, on the other hand, makes use of the audio information itself, through analyzing the low-level audio features that are directly related to the spectral and prosodic characteristics of a human’s voice.

Simplified dimensionality reduction example (from 2D to 1D)
Simplified dimensionality reduction example (from 2D to 1D). The “classifier” in the initial space(left) is a line, while in the reduced space(right) it is a single point. This means much simpler model, with the tradeoff of a small performance drop (two points are misclassified in the 1-D space).

SER is typically a supervised classification task, i.e. each “sample” (speech utterance or segment) can be represented by a feature vector and a respective class label. The higher the number of features (dimensionality), the more difficult it is for the classifier to “understand” the feature distributions. This “curse of dimensionality” can cause poor classification performances, especially when combined with insufficient numbers of data. A solution to that is Dimensionality Reduction (DR), which is the process of extracting a low-dimensional representation from a given high-dimensional feature space. DR is useful for several reasons:

  1. Mitigates the curse of dimensionality, especially when few data are available for training and the initial feature space is high, leading to better classification performance.
  2. Compresses the trained models (smaller trained models, therefore less memory in testing time). This is due to the fact that the trained model has fewer parameters when trained in smaller feature spaces.
  3. A “by-product” of dimensionality reduction is its use as a visualization and interpretation tool: 2D and 3D reduced feature representations can be very useful tools for interpreting feature distributions among classes.

The paper “Unsupervised low-rank representations for speech emotion recognition [1] presented at Interspeech 2019, Graz, Austria, demonstrated the ability of widely adopted, unsupervised dimensionality reduction techniques to either improve the performance of typical SER classifiers or visualize emotional content in a way that it can be used to interpret limitations of the datasets. In particular, widely adopted SER datasets such as IEMOCAP and EmoDB have been used to evaluate classification performance, while a real-world proprietary dataset internally annotated at Behavioral Signals has been used to evaluate the ability of the DR methods to visualize cross-domain emotional content. The research work presented in [1] tried to answer to questions:

Does unsupervised DR improve SER performance?

According to [1], the lesson learnt from the classification experiments on the reduced feature spaces, is that typical unsupervised DR methods do not significantly improve the classification performance. On the contrary, for some more advanced classifiers, such as SVM with RGB, there is a 2% performance drop. However, this is achieved with almost an 1.5% of the initial dimensions, resulting in 150 times smaller and 80 times faster classifier (sklearn implementation). In other words, the compression achieved through DR leads to similar classification performance in SER, while significantly reduces computational cost.

How can simple DR be used to interpret emotion distributions among different datasets and domains?

Reducing the initial feature spaces to 3-D or 2-D representations, can also offer a way to visualize emotional content and therefore help engineers interpret the way emotions are distributed in the “world” of audio features. The following figure illustrates the results of an unsupervised DR method (PCA in particular) into two dimensions, for the large proprietary Behavioral Signals dataset which contains several domains, from which in this demo the following have been selected: TV interviews, movies and TV series. Subfigures illustrate the distributions of the speech segments into the two PCA dimensions for three emotional classes: anger, happiness and sadness, for different domains. In addition, the decision surfaces for a simple 2-D classifier are illustrated (the figure bellow is a simplified version of the content distributions presented in [1]).

These simplified illustrations (to see the real experimental results please refer to [1]) demonstrate that:

  1. Anger is similarly distributed between sadness and happiness for the two first domains (Series and Movies), based on the primary PCA dimension (horizontal axis).
  2. For the interviews domain, the primary PCA dimension is not enough to discriminate between the emotional classes, but the anger and happiness classes are mostly discriminated based in the second PCA dimension. Interestingly, this unsupervised distribution is quite similar to the Valence-Arousal affective representation.

This example demonstrates how an unsupervised DR can be very sensitive to changes in domain when illustrating emotional content. Similarly, it has been proven in [1] that when unsupervised DR is used for visualization of emotional content, it can be also sensitive to speaker identities.

To sum up, unsupervised DR (1) can be used to provide much more compact and (almost) equally accurate classifiers and (2) can be used to provide interesting emotional content visualizations, however these can be rather sensitive to either domain changes or speaker identities.

--

--

Theodore Giannakopoulos
Behavioral Signals - Emotion AI

PhD in audio signal analysis and machine learning. Over 15 years in academia and startups. Currently Director of Machine Learning at Behavioral Signals.