Multimodal Emotion Recognition

6 min readApr 19, 2022

Introduction

Emotion recognition is a dynamic process that targets the emotional state of the person, which means that the emotions corresponding to each individual’s actions are different. Human beings, in general, communicate their feelings in different ways. To ensure meaningful communication, accurate interpretation of these emotions is important. However, emotional recognition in our everyday lives is important for social contact, and emotions play an important role in deciding human actions. The following video shows an application of Multimodal emotion recognition.

The aim of this blog is to give some insight emotion recognition on videos, by exploiting all of the three modalities: video, audio and text (the video transcript).

Why videos ?

Unlike static images, videos capture dynamic visual scenes with temporal and audio signals. Neighboring frames serve as a form of natural data augmentation, providing various object (pose, appearance), camera (geometry), and scene (illumination, object placements) configurations. They also capture the chronological order of actions and events critical for temporal reasoning. In these ways, the time dimension provides critical information that can improve the robustness of computer vision systems. Furthermore, the audio track in video can contain both natural sounds and spoken language that can be transcribed into text. These multimodal (sound and text) signals provide complementary information that can aid learning visual representations.

Unimodal architectures

Before going deep into multimodal architectures, let’s first focus on unimodal ones :

Audio modality :

Let’s focus on 2 main models, Wav2vec2 and HuBERT. They are very similar:

Wav2vec2 uses a transformer based architecure. The corresponding feature exctractor works over 25ms audio samples strided by 20ms (means two continuous samples share 5ms), the sampling rate of the audio is 16kHz. At first the audio is passed over a CNN to extract the main features, then it is passed through a Context network (transformer encoder network) to get the context vectors. It possible to start from the pretrained checkpoint using HuggingFace. The figure below shows the architecture of Wav2vec2.

HuBERT is similar to Wav2vec except during training. It uses clustering with MFCC representations to determine training targets in a prior step instead of Gumbel softmax in Wav2Vec2. The figure below shows the the architecture, on the right we can see the architecture of Wav2vec2.

Performance : The two base models contain around 95M parameters and 317M parameters for the Large models. HuBERT performs as or better than Wav2vec2 when fine tuned. With our own experiments over IEMOCAP, we found that HuBERT performs better than Wav2vec2. For the inference speed, the two models are exactly the same. Processing a 2 seconds 16kHz audio is similar (in time) to process a 170 token text with BERT base. the inference speed experiment could be found HERE

Text modality :

Given the transcript of a conversation along with speaker information of each constituent utterance, the ERC (emotion recognition in conversations) task aims to identify the emotion of each utterance from several pre-defined emotions. Formally, given the input sequence of N number of utterances [(u1, p1), (u2, p2), . . . , (uN , pN )], where each utterance ui consists of T words ui,j and spoken by party pi, the task is to predict the emotion label ei of each utterance ui.

Let’s first focus on recognizing emotions with no conversation context (distilBert), then we move to a conversation aware model:

Distilbert is a reduced BERT with lower layers and attention heads, 40% smaller and 60% faster. This model encodes the text without conversation axis. This might cause problems when trying to catch long term semantics.
TL-ERC solves the memory problem by adding a super layer of recurrent neural network. A choice of datasets (in source task and target task) is possible, in this work, we tryed the model with the weights of cornell dataset, and we finetuned the model to the task of Emotion recognition on IEMOCAP dataset. The model takes into account the dynamism in a conversation. The performance on the test data is about 46% on accuracy.

Image modality :

For the image modality, it is recommended to use a pretrained face action unit detector. The action units are some descriptors of the face that are linked expression and emotion. We will use OpenFace, it is a free and open source. We have used it to predict 18 face action units that we feed to a BI-LSTM model as in the figure below. It is possible to learn more in this paper.

Bi-modal architectures

In this paper, 2 modalities have been used: Audio and OpenFace’s action units. For audio, Wav2Vec 1 have been used, it is fully possible to upgrade to Wav2Vec2.0. The audio and face embeddings are then concatenated and fed to a simple logistic regression model.

Tri-modal architectures

For the 3 modalities, let’s focus on the model proposed in the MOSEI-UMONS paper. This model which is based on the Transformer model makes use of all three modalities (audio, text, video), where the linguistic modality to modulate the others using Co-modulation. The features which are the inputs of the transformer blocks are computed as the following:

Accoustic: The mel-spectograms were extracted using the librosa library with 80 filter banks (the embedding size is therefore 80)
Visual: A R(2+1)D-152 model was used to extract the spatio-temporal features, which takes as input a clip of 32 RGB frames of the video. Each frame is scaled to the size of 128x171 and then cropped to a window of size 112x112. The features are extracted by taking the output of the spatiotemporal pooling. The feature vector for the entire video is obtained by sliding a window of 32 RGB frames with a stride of 8 frames. We chose not to crop out the face region of the video and keep the entire image as input to the network. Indeed, the video is already centered on the person and we expect that the movement of the body such as the hands can be a good indicator for the emotion recognition and sentiment analysis tasks.
Linguistic: Embed each word in a vector of 300 dimensions using GloVe, after tokenization and lowercasing.

The model was trained and tested on the CMU-MOSEI dataset. (See paper for evaluation results)

How to combine modalities ?

When we finally end up with an embedding for each modality, it is important to have an efficient technique to combine those embeddings. We have just discussed 2 possible ways, concatenation (as discussed in the bi-modal example) And Co-modulation (which is used on the tri-modal example). This Google AI blog demonstrates that the best way for cross-modal fusion is using Bottlenecks, which help to collate and condense information from each modality before sharing it with the other, while still allowing free attention flow within a modality.