A Multimodal Approach to Performing Emotion Recognition

By Nithanth Ram

Published in

The Startup

13 min readDec 8, 2020

This work was done as a final project for EE382V: Activity Sensing and Recognition, taught by Professor Edison Thomaz, at the University of Texas at Austin

Introduction

Recent research and advances in Human-Computer Interaction (HCI) have gradually bridged the abstract gap between humans and computing systems. Significant hardware and software upgrades have yielded various applications pushing the boundaries of tangible HCI progress. With the advent of new technologies like flexible, multimodal sensors and advances in deep learning, it’s now possible to create more robust HCI systems. One area that could benefit greatly from these advances is emotional recognition. Given the current emphasis on mental health and personal productivity, an emotion recognition system would be highly valuable.

Multimodal sensing is a machine learning technique that leverages sensor-driven systems. It combines or integrates data from multiple sensors and modalities to extract richer contextual information about an event or situation. This project aimed to create a multimodal emotion recognition system by utilizing audio and image processing. Additionally, it explored the benefits and drawbacks of early fusion and late fusion approaches, which will be discussed in more detail later.

Prior Work

Extensive prior research has been conducted in the field of emotion recognition, which provided valuable guidance for designing this recognition system. By studying the work of others, I gained insights into various approaches and design choices. Tzikaris et al. (2017) employed convolutional neural networks and LSTM networks for outlier filtering and classification in audio and image recognition techniques to build an emotion recognition system. However, implementing such complex deep learning models would be computationally expensive for a student project. Therefore, I focused on investigating conventional classification algorithms from the scikit-learn library. Dzedzickis et al. (2020) presented an exhaustive overview of sensors and methods applicable to human emotion recognition, revealing multiple information channels beyond the obvious ones like electrocardiography (heart) and electroencephalography (brain). Additionally, Venkataraman et al. (2019) explored the significance of MFCCs, pitch, and energy as critical features for audio recognition, which guided my feature exploration and engineering for the audio modality.

Dataset: RAVDESS

For this project, the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) was used for training and testing all models. The audio samples are small 3–4 second sentences of two variations: “Kids are talking by the door” and “Dogs are sitting by the door”. The dataset consists of 24 actors (male and female), two modalities (audio, video), two vocal channels (speech, song) and eight emotions: neutral, calm, happy, sad, angry, fearful, disgust, surprised. Instead of using the entire dataset, I took a subset of 10 actors and only used the speech vocal channel to decrease the load for training.

Looking at the subset of data used for training, both the gender distribution and emotion distribution are practically uniform (with the exception of the ‘neutral’ samples). I’ll retroactively introduce more variance in the dataset to see how well the system upholds or degrades.

Feature Extraction

Audio

Feature extraction for the audio portion was performed by segmenting and framing audio signals using Librosa. Frames of length 0.1 seconds with 50% overlap were extracted as samples. Regarding the actual features calculated, the first 20 Mel-Frequency Cepstrum Coefficients (MFCC) were extracted from the Mel spectrogram and added to the feature matrix. Additionally, the mean and standard deviation of the zero-crossing rate (ZCR), signal root mean squared value (RMS), and spectral rolloff were included. These features were selected based on the insights gained from prior audio emotion recognition research. The MFCCs, being widely used for most audio recognition problems, can effectively capture the nuanced structure of the time power spectrum envelope — a person’s voice and speech are well approximated and modeled by the MFCCs. Unique phonemes are accurately distinguished, making the MFCCs a solid choice for an audio feature.

The ZCR, RMS, and spectral rolloff of the signal are also rich features since they provide context regarding the articulation and intensity of audio signals: two attributes that can distinguish discrete emotions. The chroma vector seemed intuitive to use initially, as it makes up pitch profiles with audio signals, but upon including it, testing performance decreased significantly. This may have been because of pitch-gender mapping issues.

Image

Regarding image feature extraction, the video of each data sample was passed into a frame extractor using OpenCV. The frames from the beginning and end of the sequence were discarded to allow the actors time to visually emote the specific emotion. Initially, an attempt was made to denoise the frames using thresholding, but this technique removed crucial facial features necessary for image-based emotion classification. Instead, the images were gray-scaled and edged using a canny-edge algorithm from Scikit-Image. They were also reshaped to a smaller scale factor to reduce the number of pixels in the feature matrix.

Individual Models

Before constructing the multimodal classifier, individual models needed to be trained and analyzed to evaluate how each modality could independently solve the emotion recognition task. Three models were trained for each modality, with a train-test-split of 80% and 20%, respectively. To prevent overfitting, 10-fold cross-validation was performed on these models. The three classifiers used were the Random Forest classifier, the Naive Bayes classifier, and the Multilayer Perceptron classifier, all from the scikit-learn library. While deep learning approaches, such as CNNs, could have been utilized due to their power in image and audio classification, as well as their ability to handle multimodal data, most of the prior work examined had already explored deep learning techniques. Therefore, to explore a different route (and due to computational resource limitations), the efficacy of these traditional algorithms for emotion recognition was investigated. The hyper-parameters of the three models for each modality were tuned using GridSearchCV.

As seen above, the Random Forest classifier performed the best for both modalities, with accuracy and F1-scores for audio and image of 0.71, 0.68 and 0.62, 0.55 respectively. Here, we can also see the audio models outperform the image models for all three of the classifiers. The audio Random Forest classifier’s 0.71 accuracy score and 0.68 F1-score doesn’t really represent the greatest performance of a model, but it is a good baseline to work with. Now, maybe the sensor fusion techniques can improve the model performances.

Sensor Fusion

Early Fusion

One approach to combining the individual models into a multimodal system is early fusion, where a single feature matrix is constructed by merging the data from both modalities into one comprehensive representation of each data sample. In this case, the image and audio features were matched up with each other, pairing the video frame with the corresponding audio signal segment based on their respective timestamps. However, two major issues arose with the early fusion technique. First, the combined feature matrix became extremely dense, significantly increasing the training load and computational time. More importantly, the data across different modalities needed to be precisely synchronized, which proved challenging due to the mismatch in sampling rates. While the audio was sampled at 22050 Hz, the video was sampled at 30 FPS (Hz). This discrepancy in sampling rates caused data alignment issues, which was evident in the poor performance of the early fusion model. The same classification models were trained on this new, augmented feature matrix (with 10-fold cross-validation), but they exhibited a sharp degradation in performance compared to the individual modality models.

As shown in the results, the early fusion approach led to a significant degradation in performance for the optimal Random Forest classifier, with its accuracy score decreasing to 0.51 and its F1-score dropping to 0.48. The early fusion models were riddled with inaccurate predictions, especially demonstrated by the very poor performance of the MLP classifier. An analysis of its confusion matrix revealed that the predictions were wildly incorrect, suggesting that early fusion was not the ideal approach for building this multimodal system.

Late Fusion

Late fusion is a different multimodal technique that utilizes the independent outputs of individual models in tandem to produce a singular resultant output. This ensemble learning technique intuitively avoids the constraint of dense input data, as observed in early fusion. The base input models can either pass their outputs to another layer containing a third model or combine their predictions using a voting system

The latter approach was taken to build this late fusion model. The Random Forest classifiers from the individual training phase, being the optimal models for both modalities, were jointly incorporated into the late fusion model. For each data sample, the entire video and audio signals were passed to their respective input models. These models segmented/framed each signal and kept a running frequency count of their predictions for each analyzed frame. The common emotion frequencies across each model were then summed as a vector in the output layer, and the emotion with the highest total was designated as the predicted label. Since the audio Random Forest model performed better than the image model, the audio model’s predictions were given a weighted coefficient of 1.15 (the ratio of accuracy scores from the audio model to the video model). The equation below shows how the predictions over one data sample were aggregated to vote and decide on one final output prediction.

Random Forest Late Fusion Confusion Matrix

Late Fusion Random Forest Model Performance

In terms of performance, the late fusion Random Forest model demonstrated a significant improvement over the individual modality baselines. By employing the late fusion technique, the model achieved an accuracy of 0.79 and an F1-score of 0.76, surpassing the individual audio and image model scores. These improved metrics indicate that the late fusion approach effectively leveraged the complementary information from both modalities. Analyzing the confusion matrix reveals a much cleaner prediction distribution compared to the early fusion models. The predictions were far more accurate this time around. Additionally, each class of emotions exhibited relatively high (or at least improved) precision, recall, and F1-scores.

Late Fusion Random Forest Model Class Metrics

Out of all the models built, trained, and tested, the late fusion Random Forest classifier performs the best. This result, in a microcosm, demonstrates how a multimodal approach can improve upon its existing architecture if the proper technique is chosen — in this case, late fusion.

In-the-wild Model Performance

The late fusion Random Forest model performed well with the RAVDESS dataset, but its generalization to real-time collected samples was less satisfactory. To test this, in-the-wild samples were collected within the household due to COVID-19 restrictions. With four household members, three samples were collected for each emotion to observe how the late fusion system processed and judged newly collected data. The model’s accuracy degraded to 0.66, with an F1-score of 0.56, indicating the need for more diverse training data.

Generalizability is a top priority for recognition systems intended to handle input variance in real-time. In this case, more varied samples need to be fed to the model to at least maintain the previous baseline accuracy. Additionally, the cross-cultural effects of language on emotion recognition were investigated. Throughout the system design and model training process, the focus was primarily on the computing side, neglecting the human aspect. Specifically, the aim was to see if speaking a different language evoked a noticeable effect on the model’s emotion prediction. For the emotions “happy,” “sad,” “angry,” and “neutral,” five samples were taken across the same five participants, but this time, the sentence samples were spoken in Tamil (the language spoken in the household). The model’s accuracy and F1 scores decreased to 0.65 and 0.55, respectively.

In-the-wild Model Performance of Late Fusion Random Forest Model

The degradation in performance may be attributed to cross-cultural variances in communication. Different languages may have different vocal intonations, pitch inflections, facial expressions, etc., when they are spoken and emoted. As a result, the model, trained purely on English data, may not recognize these differences. While not enough Tamil samples were collected to make a definitive statement, analyzing the impact of different languages on this late fusion Random Forest model is an intriguing avenue for further exploration. Generalizability is challenging to achieve, as it has many dimensions to consider, especially when prioritizing accessibility with these systems.

Key Takeaways

This project has been a invaluable learning experience, providing insights into various aspects of feature engineering and multimodal system design. One of the key lessons learned was the rigorous nature of feature selection and extraction, where there is no one-size-fits-all solution. A strong set of features for one model may not work for another, exemplifying the ‘no free lunch’ theorem, which suggests that optimization techniques are bound to the specific structure of the problem being tackled.

The value of MFCCs as a feature for audio-based recognition systems was also observed, demonstrating their effectiveness in capturing the nuances of speech and voice. On the other hand, the experience with early fusion highlighted its fundamental limitations, such as the burden of dense feature vectors and data synchronization issues. These challenges essentially bottleneck the creation of complex multimodal systems, making it difficult to envision early fusion being utilized beyond two sensor modalities.

Despite the shortcomings of early fusion in this instance, the power of leveraging a multimodal approach was evident. The late fusion Random Forest model effectively aggregated predictions from multiple modalities, showcasing how data from different perspectives can aid in accurate event classification. However, selecting the appropriate fusion technique is critical, as an improper method can severely degrade a model’s accuracy.

Furthermore, the project highlighted the cross-cultural impact of language on emotion recognition performance. Different languages and cultures have distinct communication styles that may not be recognized by a system trained solely on an English corpus. To achieve better generalizability across diverse cultures, varied linguistic data samples must be included during the training phase.

Conclusion

Among the models explored, the late fusion Random Forest model emerged as the optimal choice, achieving an accuracy of 0.79 and an F1-score of 0.76. This highlights how a multimodal approach leveraging late fusion can improve upon the efficacy of single-modality models. However, when evaluated with household-collected data, the model’s performance degraded, resulting in an accuracy of 0.68 and an F1-score of 0.62. This demonstrates that extending the input data and conducting more rigorous training is necessary to enhance generalizability.

Furthermore, the use of a different language, Tamil, resulted in the late fusion Random Forest model’s accuracy declining to 0.65. This finding underscores the existence of cultural and linguistic differences in human communication and emotion expression. Investigating this notion further is crucial to assess the feasibility of building a widely culture-encompassing emotion recognition system.

Future Work and Applications

Future work with this project involves exploring deep learning approaches to potentially improve the system’s accuracy. Convolutional neural networks (CNNs) are currently the gold standard for image and audio recognition tasks. They have the inherent capability to create multimodal systems by allowing multiple input channels in their architecture. Specifically, existing CNN architectures such as VGG16 and VGG19 can be investigated to assess their impact on model performance. Additionally, transformer-based models may also have the potential to classify emotions more effectively, and it would be worthwhile to attempt solving this problem from a deep learning perspective.

Another avenue for extension is making the model more generalizable to individuals with disabilities, in addition to different languages. Existing emotion recognition systems often overlook differently-abled groups during data curation and training, introducing inherent biases into the model’s core learning. It is crucial to train these systems to extend to people with disabilities, such as those with Down syndrome, facial palsy, or speech impediments, ensuring their inclusion with the same degree of accuracy as the training data permits.

Ultimately, the goal is to build out and deploy this system in a real-world setting. Conceptually, a rudimentary design has been mapped out, and it would be interesting to observe how the late fusion Random Forest model performs in a realized system.

Conceptual Diagram of Emotion Recognition System

The potential applications of a multimodal emotion recognition system are plentiful. One significant application lies in virtual counseling and monitoring of individuals’ mental health. Given the recent rise in mental health awareness and the potential decline in average mental health due to the pandemic, a virtual emotion analysis and monitoring system would be vital in tracking users’ mental well-being. Additionally, such a system could be utilized to track early child development, notifying individuals about any areas of concern that need to be addressed when fostering and nurturing a young child. Finally, an emotion recognition system could benefit the visually-impaired community by enhancing their ability to communicate with others. A system consisting of a camera and headphones could provide visually-impaired individuals with better insights into the emotional state of the person they are interacting with.

This project has been an excellent learning experience, kickstarting my journey in ML/AI. I am eager to continue this work in the future, potentially pursuing it as a research avenue. Thank you for your engagement and special thanks to my professor Dr. Edison Thomaz for his support!

References

https://zenodo.org/record/1188976#.X73KC6pKg1J (RAVDESS Dataset)
Morency, Baltrusaitis, L. P. M. T. B. (2016, January 1). Tutorial on Multimodal Machine Learning [Slides]. Carnegie Mellon University. https://www.cs.cmu.edu/~morency/MMML-Tutorial-ACL2017.pdf
Kannan Venkataramanan, & Haresh Rengaraj Rajamohan. (2019). Emotion Recognition from Speech.
P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller and S. Zafeiriou, “End-to-End Multimodal Emotion Recognition Using Deep Neural Networks,” in IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, Dec. 2017, doi: 10.1109/JSTSP.2017.2764438.
Dzedzickis, A., Kaklauskas, A., & Bucinskas, V. (2020). Human Emotion Recognition: Review of Sensors and Methods. Sensors (Basel, Switzerland), 20(3), 592. https://doi.org/10.3390/s20030592

Packages used: Librosa, OpenCV, Scikit-Learn, Scikit-Image, Matplotlib, Numpy, Pandas, Seaborn