Affective Computing using Deep Learning-Part 3: Latent Subgroups Analysis for MAHNOB-HCI

Ashutosh Singh
6 min readSep 3, 2023

--

Please make sure you read Part-1 for exploratory analysis and intuitions from affective computing datasets and if you want to explore works around data-fusion in for affective computing using deep learning please read Part-2.

The results of LOSO(Leave One Subject Out) evaluation did not look good during valence/arousal classification experiments; for both unimodal and fusion methods. The predictions had very high uncertainty[5]. We decided to explore the MAHNOB dataset for emotion classification a little more — first using simple handcrafted statistical features, and then with more advanced methods to study the characteristics of the dataset from high/low valence(arousal) and subject level behaviour perspective.

Handcrafted statistical Features

We use simple statistical features as described in [1] namely; max, min, mean, median, variance, difference between max and min values for original signal along with 1st and 2nd order difference of the signal as described in the Figure-1.

Figure-1: Flow chat for this analysis. We use t-sne fit with `n_components=2`. Source Master Thesis presentation by Ashutosh Singh @ FAU Erlangen and Fraunhofer IIS,Erlangen.

The idea behind this analysis was to see if there is some difference between behaviour of these signals for baseline and stimilus periods(see Part-1) and also the if different different stimuli(different videos) lead to some differences or not. These values can be used to see how these signals differ when presented with different type of stimuli(anger, disgust etc from preliminary emotion tags — see Part-1).

Figure-2: t-SNE visualisation of simple statistical features for Resp signal. ‘x’ marks are for baseline periods and ‘o’ if stimulus duration.
Figure-3: t-SNE visualisation of simple statistical features for GSR signal. ‘x’ marks are for baseline periods and ‘o’ if stimulus duration.

There aren’t any clusters visible between either baseline and stimulus period features or different emotion tags, see how ‘joy’ and ‘anger/sadness’ are spread out almost identically. These results show that these simple statistical features are not good enough for emotion classification, at-least for MAHNOB-HCI dataset.

Latent Features from Unsupervised Learning

To overcome the shortcomings of the handcrafted features, we decided to use Unsupervised Learning to model the data. The idea was we could then compare the distributions of signals from different stimuli to have some estimate of any significant difference between high/low valence samples for example.

Unsupervised learning(in this case) makes a lot of sense as MAHNOB-HCI dataset has emotions labels associated to a small fraction of trials as compared to the total number of trials.(See Part-1 for this breakdown), so we can possibly learn rich representations.

Self-Supervised learning may be better a approach if these representations are to be used for down-stream tasks, but we stick to unsupervised learning with VAE given our task of comparing distributions.

The generated latent representations z follow the distribution illustrated below.

Since the distribution is diagonal; meaning no covariance amongst dimensions in latent space, each dimension may be treated independently.

After generating latent representations from VAE we primarily use them for analysing subgroups in the dataset from the multiple perspectives. We wish to confirm the presence of any such subgroups from the perspective of stimuli, subjects and the self-rating for valence and arousal. Any dominant sub-groups(for example subject-level) which are conditionally independent of the class labels might explain co-variate shifts in classifier training data leading to unstable training and high uncertainty in predictions.The presence of such groups is identified using a distance metric between distributions of these groups. We use Wasserstein Metric[[2], [3]] to measure this distance.

In short subgroups based on self-rating for valence and arousal should have larger distances(ideally) than any other groups in the dataset.

We calculate this distance for each dimension in latent space(owing to diagonal distribution) to get a vector d. The final distance d is calculated as:

Figure-4: Illustration of process used to calculate the distance between the groups.

In order to check of how inter-class distances for standard datasets look like , we also report wasserstein distance metric for MNIST Dataset[4] where separate classes are treated as groups.

Figure-5: Wasserstein distances calculated for distributions of 10 classes of MNIST dataset with the method illustrated in Figure-4

The first observation to make here is how distribution of class-0 distribution is farthest from that of class-1 which is the digit that appears most unlike 0.

Figure-6: Wasserstein distances between HV, LV, HA and LA subgroups. Groups were created based on nature of stimulus based on preliminary emotion tags where amusement, joy videos are treated as high valence and sadness fear and disgust videos as low valence stimuli.
Figure-7: Wasserstein distances between HV, LV, HA and LA subgroups. Groups were created based self-rating by the participants

The distances for MNIST dataset amongst different subgroups for MNIST is quite high as compared to that for MAHNOB-HCI dataset. However, MNIST is much simpler than MAHNOB, so the next question we had was — “Are the distances between High/Low Valence(or Arousal) are large enough?”

In context of MAHNOB-HCI we had an obvious of answering this question to certain extent — by using inter-subject variance.

Again at this point I think it would interesting to try an approach where the feature representations are generated such that the loss especially tries to decrease these inter-subject distances while increasing the rating differences.

We can repeat the same exercise of creating subgroups, but this time one group for each subject or participant.

Figure-8: Distances between groups for each subject.

The mean value of inter-subject distances comes out to be 0.0313 ± 0.009 which is much larger than highest distances between both rating and stimuli nature based groups which are 0.0069 and 0.0032 respectively.

Conclusion

We hypothesise that high inter-subject differences as compared to rating based differences in the dataset might be causing the high uncertainty that we see in the classifier predictions during out experiments, fusion helps the performance a little but still the uncertainty is quite high to be able to evaluate the performance using LOSO.

We also see this common trend of high accuracy results from publications that are not using LOSO.

Unsupervised and Self-Supervised Learning can help a lot since MAHBOB-HCI dataset has a lot of samples with emotion labels. We can also skip LOSO and deal with this high inter-subject variance if we decide to look at the emotion recognition system as a personalised for each subject; fine-tuned for one specific person which maybe deployed on their smart-watch.

[1]: Wei Wei, Qingxuan Jia, Yongli Feng, and Gang Chen. Emotion recognition based on weighted fusion strategy of multichannel physiological signals. Comput Intell Neurosci, 2018:5296523, July 2018.

[2]: L. N. Vaserstein. Markov Processes over Denumerable Products of Spaces, Describing Large Systems of Automata. Probl. Peredachi Inf., 5:3, 1969.

[3]: Kantorovich LV. Mathematical methods of organizing and planning production. Management Science. 6, 6, 1939.

[4]: Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.

[5]: Master thesis — Ashutosh Singh @ Fraunhofer-IIS and University of Erlangen-Nuremberg

--

--