Affective Computing Using Deep Learning-Part 2: Data Fusion, Introduction and Literature Review

Ashutosh Singh
4 min readAug 27, 2023

--

Data fusion is the process combining data from multiple modalities with the goal of getting complimentary information from each modality to get more accurate representations, which are better than those provided by individual modalities for machine learning tasks like classification or regression. Fusion strategies are mainly grouped into three categories namely Early Fusion, Intermediate Fusion and Late Fusion.

Figure: Illustrations for different fusion strategies. (a) Late Fusion, (b) Early Fusion,
(c) Intermediate Fusion. Functions represent parameterised models, red arrow
represents backpropagation during training. Source [1]

Late Fusion

Late fusion also known as decision fusion, is the process of combining decision made by the machine/deep learning model. For example in case of classification the probabilities of different classes may be averaged and normalised over all modalities, or some boosting techniques.

Intermediate Fusion

It is the process of combining intermediate representations from different modalities. These intermediate representations are learnt separately for every modality before the fusion step. A simple example is having different encoder for each modality and then fusing the representations from these encoders via some interaction. In this work we focus on intermediate fusion.

Early Fusion

Early fusion refers to the process of joining/combining multiple modalities together before feeding them to the machine/deep learning model. The modalities maybe combined in multiple ways for example, channel-wise concatenation of PSD based images.

Deep Learning based Fusion for Affective Computing — Literature Review

There have been a lot of works on data fusion for Affective Computing and Emotion Recognition . I am only listing down a few of them which I researched during the course of my thesis and found most interesting for my work. The common attribute is the use of deep learning and automatic feature extraction, instead of handcrafted features(many works in this direction as well.)

DeepVANet: A Deep End-to-End Network for Multi-modal Emotion Recognition [2]

  • Fuse video and bio-signals at feature level.
  • Bio-Signals fused in an early fusion style as channels of 1D-Spatial signals.
  • 1D-Convolutions and LSTM as modelling units.
  • Mention of per-subject evaluation — Similar to LOSO; Data leakage found on trial level.

Utilizing Deep Learning Towards Multi-modal Bio-sensing and Vision-based [3]

  • Results for individual modalities along with fusion results.
  • Use of pre-trained VGG-16 network to extract features from PSD images from EEG signals.
  • For ECG/PPG signals; the signals are converted in spectrograms and again a pre-trained VGG-16 is used to extract features.
  • Validation with 10-Fold Cross Validation with 80/20 split; LOSO not employed.

Deep Learning Method for Selecting Effective Models and Feature Groups in Emotion Recognition Using an Asian Multimodal Database[4]

  • Uses Genetic Algorithm for selecting models given the dataset.
  • LSTM model used for each EEG channel and simple concatenation of final LSTM outputs are fed to FCN.
  • LOSO not employed.
  • The authors got continuous emotional state tagging results from MAHNOB-HCI authors(This could be a good idea and might solve the problem of localisation of emotion that even I faced during my thesis)

Multimodal Emotion Recognition Using a Hierarchical Fusion Convolutional Neural Network [5]

  • Handcrafted features extracted from EEG and PPS(PPG, GSR, Resp Temp) individually.
  • The preprocessed EEG and PPS signals formed in a unified vector and then used as input for CNN to extract features.
  • Handcrafted features and convolutional features fused together with weighted fusion.
  • Random Forest as final classifier. LOSO not employed.

Automatic Emotion Recognition Using Temporal Multimodal Deep Learning [6]

  • EEG and BVP signals used.
  • Signals have individual CNN encoders followed by concatenation of CNN output at each time-step t. Author’s call this their Early Fusion
  • They also present an alternative where the modalities have individual CNN-LSTM encoders and the final output is concatenated for classification and this is represented as the late fusion in the paper.
  • Segments of length 10s proved to the best. LOSO employed but dataset other than MAHNOB-HCI.

CNN and LSTM-Based Emotion Charting Using Physiological Signals [7]

  • EEG preprocessed as 2D images and ECG and GSR used as 1D time series signals.
  • ECG and GSR processed using a 1D CNN-LSTM.
  • The final fusion is done as a Majority Vote Mechanism across ECG, EEG and GSR outputs.

Make sure you read Part-1 for exploratory analysis and intuitions from affective computing datasets and Part-3 for deeper latent analysis using Variational Encoders.

[1]: Master thesis — Ashutosh Singh @ Fraunhofer-IIS and University of Erlangen-Nuremberg

[2]: Yuhao Zhang, Md Zakir Hossain, and Shafin Rahman. Deepvanet: A deep end-to- end network for multi-modal emotion recognition. In Carmelo Ardito, Rosa Lanzilotti, Alessio Malizia, Helen Petrie, Antonio Piccinno, Giuseppe Desolda, and Kori Inkpen, editors, Human-Computer Interaction – INTERACT 2021, Seiten 227–237, Cham, 2021. Springer International Publishing.

[3]: Siddharth, Tzyy-Ping Jung, and Terrence J. Sejnowski. Utilizing deep learning towards multi-modal bio-sensing and vision-based affective computing. CoRR, abs/1905.07039, 2019.

[4]: Jun-Ho Maeng, Dong-Hyun Kang, and Deok-Hwan Kim. Deep learning method for selecting effective models and feature groups in emotion recognition using an asian multimodal database. Electronics, 9(12), 2020.

[5]: Yong Zhang, Cheng Cheng, and Yidie Zhang. Multimodal emotion recognition using a hierarchical fusion convolutional neural network. IEEE Access, 9:7943–7951, 2021.

[6]: Bahareh Nakisa, Mohammad Naim Rastgoo, Andry Rakotonirainy, Frederic Maire, and Vinod Chandran. Automatic emotion recognition using temporal multimodal deep learning. IEEE Access, 8:225463–225474, 2020.

[7]: Muhammad Najam Dar, Muhammad Usman Akram, Sajid Gul Khawaja, and Amit N. Pujari. Cnn and lstm-based emotion charting using physiological signals. Sensors, 20(16), 2020.

--

--