Affective Computing using Deep Learning-Part 1: MAHNOB-HCI dataset analysis

6 min readAug 27, 2023

Introduction
MAHNOB-HCI dataset
Stimuli
Emotion Recognition problem
Visualising raw signals
Final Remarks

1. Introduction

Affective computing is a multidisciplinary field that involves the research and advancement of systems and devices capable of identifying, understanding, analyzing, and replicating human emotions. This field combines elements from computer science, psychology, and cognitive science.
Emotion detection research is becoming very popular due to wide range of applications. Human emotions are very complex and are reflected completely/partially in multiple types of signals like facial expressions, heart rate, body temperature, brain function etc. Current datasets available for emotion detection generally contain these signals in form of multiple modalities like Facial Video/Audio, ECG, Galvanic Skin Response(GSR), Respiration Amplitude, Eye Gaze Tracking data and Electroencephalogram(EEG). While each of these modalities plays an important role in human emotion detection some of the signals are more difficult and expensive to record than others.
There are multiple ways to formulate the problem of emotion classification such as classifying between discrete emotions like joy, anger, sadness, disgust etc. or using a dimensional model for emotions and associating each emotion to a point in this space. There are multiple dimensional models for emotions, most of these models use Valence and Arousal dimensions and all emotions can be represented in this 2-Dimensional Space.

Figure 1: Valence(negative/positive) and Arousal(low/high) as the two dimensions of emotions[1]

Valence and Arousal values for MAHNOB-HCI dataset are lie in {1, 9}.

2. MAHNOB-HCI Dataset

MAHNOB-HCI dataset[2], where HCI stands for Human Computer Interaction, is a very commonly used dataset for emotion or valence/arousal classification. The dataset itself is a combination of 2 experiments where some test subjects(humans) are given some form of stimuli(videos and images) and then they are asked to rate their emotional state on a discrete scale of 1–9 for valence and arousal.

Figure 2: Summary of MAHNOB-HCI dataset experiments[2].

The two types of experiments are illustrated in the figures below.

Figure 3: Setup for experiment type 1. Subject’s baseline is recorded and then provided with stimuli to elicit emotion and then asked to rate his valence/arousal on a scale of 1–9[3]

Figure 4: Setup for experiment type 2. Subject’s baseline is recorded and then provided with an image with a textual prompt on the image. Then subject is asked whether they agree or disagree with the tag given the image. This is just an example and not from real dataset[3]

Multiple bio-signals are collected in the MAHNOB-HCI dataset. Below is a list of signals that will be used for valence-arousal detection in this study. We will refer to these signals by the acronyms of their names as specified in the first column.

Table 2: List of physiological signals recorded in MAHNOB-HCI dataset

3. Stimuli

Now let’s dive a little deeper into the dataset, starting with the stimuli. MAHNOB-HCI dataset employs videos(and images) as stimuli to elicit emotional response from the participants.The emotion elicitation experiment contains 20 video clips which were selected from multiple commercially produced movies. The authors did a preliminary online study to get emotion tags corresponding to each video clip where each clip received at least 10 annotations from over 50 participants.

Figure : Video Stimuli and corresponding emotion(as surveyed from preliminary study)[3]

Now let’s see how the subjects rated the stimuli video clips and what the median rating was for each one of them, for valence and arousal.

Figure : Valence(left) and Arousal(right) ratings for each stimulus.The size of the dots is proportional to number votes for given rating i.e if more subjects rate 5 it’s represented by a bigger dot. Source [3]

The visualisations presented below confirm that in most cases majority
of participants agree on a given rating for both Arousal and Valence; however there are quite significant differences as well. For example in case of valence we see much more significant spread of for some clips like earworm_f.avi. Arousal ratings are much more spread out than
valence, even for high arousal stimuli like joy(79.avi, 80.avi and 90.avi).

4. Emotion Recognition problem

Since the rating scale of Valance/Arousal is discrete it’s quite counterintuitive to model the problem of emotion recognition as a regression problem, when using MAHNOB-HCI. So we can consider this problem as a classification problem.

Next problem is since we have very small number of trials available, classifying between 9 classes might be too difficult(also considering the imbalance in dataset) but we can formulate the problem as a coarse problem of binary classification for valence and arousal by transforming rating scale to 0–1 from 1–9.

Figure : Coarser discretisation of 9 point rating scale to 0–1 or Low/High scale[3]

The problem is not exactly solved yet, there is some trade-off involved in the decision of whether to chose 5 has high or low as it is the most frequent rating in dataset.

Figure : Variation of Valence with Arousal. Large number of samples have a rating of 5 which leads to large bias if 5 is treated as LA/LV[3]

The figure below shows the class balance in data coming from media with preliminary emotion tags of Sadness, Joy and Neutral. We present these distributions for both thresholds of 4.5 and 5.5. We can see the trade-off in class distribution based on value of threshold, for example for preliminary tag sadness the number of high valence samples decreases as we go from threshold of 4.5 to 5.5, which is expected. However for joy , number of samples in high valence class samples also decreases which is undesired, a solution to this problem could be different thresholds for different stimuli.

Figure: Distribution of High/Low Valence samples for preliminary tags of sadness(Low Valence), Joy(High Valence) and Neutral for (a) threshold = 4.5 and (b) threshold = 5.5. Source [3]

In my experiments I chose the threshold of 4.5 for all experiments.

5. Visualising raw signals

Finally we can visualise some raw signals from different modalities like Skin Temperature and GSR etc. for different subjects.

Figure :Skin temperatures for different kinds of stimuli provided and the corresponding rating received for subject 11. The vertical blue line separates the baseline period from actual stimulus period. Source [3]

It’s clearly visible from the figure above that the skin temperature is not in normal range for subject-11 which might be due to the temperature sensor placement. Looking at the same results for subject-10 we see the baseline temperatures are higher than those in stimulus period, which becomes a problem when baseline period is ignored when detecting the emotional state.

Skin temperatures for different kinds of stimuli provided and the corresponding rating received for subject 10. Source [3].

I will also show GSR and Respiration results just for completion but it is not possible(we can speculate though:)) to derive any such insights directly from those signals since unlike temperature their rate of variation is a little faster.

Figure : GSR signals for multiple stimulus from different baseline emotions for subjects 9 and 10. The vertical line marks end of baseline. The legends have stimulus name along with preliminary emotion associated and the arousal/valence ratings given. Source [3]

If you see the response to 111.avi for subject-17, we see almost no response. This is quite common in MAHNOB-HCI dataset, the subject-level variations which require a more deeper analysis.

Figure : Resp signal — Frequency Spectrum, Signals with trend removed and raw signals(left to right) for subject-17

The signals seem quite similar except for large variations here and there which might be an artefact of gasping or laughing(which can be verified by looking at the recorded participant video).

6. Final Remarks

MAHNOB-HCI dataset is not an easy dataset to run analysis and derive insights from, there is a strong variance of reactions to a stimuli among different subjects. Even if there seems to be a consensus between valence and arousal ratings, the perceived scale of these ratings varies a lot for different subjects which is evident from their physiological signals. For example: I might be very joyful and give a valence rating of only 5 given what I perceive as the most joyful I have ever been.

The dataset is also quite noisy, I found number of trials where the ECG signals have a lot of sensor noise. So before doing any analysis a lot of cleaning up might be required.

Temperature does not seem to be representative of the stimulus, which we see in the raw temperature signals.

Proceed to part-2 for a literature review of Data Fusion in Affective Computing in context of deep-learning. Part-3 for a deeper analysis on MAHNOB.

[1]: Enrique Munoz-De-Escalona and Jos´e Canas. Online measuring of available resources.

[2]: Mohammad Soleymani, Jeroen Lichtenauer, Thierry Pun, and Maja Pantic. A multimodal database for affect recognition and implicit tagging. IEEE Trans. Affect. Comput., 3(1):42–55, jan 2012.

[3]: Master thesis - Ashutosh Singh @ Fraunhofer-IIS and University of Erlangen-Nuremberg