The BSC’s approach for Speech Emotion Recognition with Attention
The Emotion Recognition Systems can be used for detecting and monitoring people’s emotions to prevent mental health disorders
The Barcelona Supercomputing Center’s (BSC) approach won the EmoSPeech2024 challenge at IberLEF2024
Emotions play a vital role in society, influencing personal relationships and shaping decision-making. They are involved in the evolution and consciousness of any mental process and their influence in the human health has been empirically proven. Affective computing is a growing field that encompasses all the efforts related in recognizing and interpreting human emotions by making use of algorithms and devices. Within this context, Speech Emotion Recognition (SER) stands as a key branch, relying on acoustic features such as pitch, prominence, and phrasing to predict emotions.
One of the most promising applications of such models lies in the healthcare field, where it could enable real-time monitoring of patients’ emotional states. Monitoring a person’s emotions could play a crucial role in the early detection of certain mental health disorders and assist in the diagnosis of patients. Additionally, this technology could be employed in customer service and call centers, where understanding the emotional tone of interactions can improve communication and enhance user satisfaction.
The Barcelona Supercomputing Center — Centro Nacional de Supercomputación (BSC-CNS), through the Aina project launched by the Generalitat de Catalunya, developed a multimodal system that integrates speech and text to accurately classify the emotions conveyed in spoken language. This model was capable of wining a speech emotion recognition competition, the EmoSPeech2024, proposed by the IberLEF2024. In this challenge thirteen teams competed to have the highest F1-Score possible. This metric quantifies the performance of a model by calculating the harmonic mean between the precision, how well a model is capable of detecting something, and the recall, the which tells how well the model manages with the false positives. The team composed by the BSC managed to obtain 86.7% of this metric, ending up in the first position of the competition.
The system takes both text and speech as input. Initially, a data augmentation module generates synthetic speech data to enhance the system’s robustness. Then, the data is passed through into two self-supervised models, the XLSR-Wav2Vec2.0, which handles speech, and the XLM-RoBERTa Large for Spanish, which processes text. Both of these models extract 1,024 hidden state vectors which are concatenated and pooled into one single vector using Attention. This vector is processed by two dense layers that are the responsible of making the classification.
The Attention Pooling Mechanism used in this work differs from the most common implementation. While the classical approach involves calculating Queries, Keys, and Values, the method used by the authors just has a trainable parameter, u, which is multiplied with the transposed vector of the vector h_i to obtain the weight of the sum. By reducing the number of parameters involved in this operation, it is possible to achieve a model with lower capacity that is less prone to overfitting on small datasets.