Deep Hierarchical Fusion with Application in Sentiment Analysis

Efthymis Georgiou
Behavioral Signals - Emotion AI
4 min readOct 21, 2019

This article briefly describes our work entitled as Deep Hierarchical Fusion with Application in Sentiment Analysis and its purpose is two-fold. The first is to introduce the reader to the multimodal machine learning research area and underline the need of exploiting different modalities, in order to encapsulate all semantic and affective information of a message. The second is to give an insight of our work and specifically describe how exactly the proposed algorithm combines (fuses) different modalities.

Why go Multimodal

Human communication is a complex process which naturally involves multiple modalities. For example the spoken sentence “Get out of here” conveys negative sentiment when expressed in an angry manner. However, the same sentence spoken in a positive way might be part of a friendly conversation. This example demonstrates the need of taking into account the acoustic modality.

On the contrary, the sentences “I love you” and “I hate you” spoken in the same calm vocal tone, cannot be classified as positive and negative sentences respectively, without exploiting the supplementary information carried by the textual modality.

These simple examples illustrate the need for taking into account both the acoustic and the textual modality when identifying the emotion carried in a message. This task is known as Sentiment Analysis and the procedure of combining different modalities as Multimodal Fusion.

Sentiment Analysis (Photo by Fausto García on Unsplash)

The Basic Idea

The proposed architecture consists of three main parts. The one is a textual sentiment analyzer, while the other is an acoustic sentiment analyzer. The heart of the proposed approach lies in the third part which is the fusion architecture called Deep Hierarchical Fusion (DHF). The acoustic and the textual networks are used as encoders. This means that they encode information from each single modality in a meaningful way and they supply the fusion architecture with these single modality representations.

The basic idea behind DHF, is to fuse at multiple interconnected stages and to constantly propagate forward the multimodal along with the unimodal information forward.

Deep Hierarchical Fusion architecture [1].

Let’s explain the idea a bit more. The information “flows” in two ways denoted by vertical and horizontal arrows. The vertical arrows represent the unimodal encodings that are fed to the fusion classifier whereas the horizontal ones the forward propagation of fused information. The fusion levels are hierarchically arranged, because the goal is to predict the sentiment (third-level) from a sentence (second-level) which naturally consists of a word combination (first-level).

For instance, the sentence “Get out of here” is processed at the first level word-by-word (e.g “Get” then “Get out” etc) and then a representation is fed to the sentence level. The sentence level processes this representation along with the single-modality ones and feeds its output to the high-level which in turn extracts another representation which is suitable for performing classification.

This idea naturally introduces depth in the fusion architecture itself and the multiple fusion levels are also hierarchically arranged.

The Results

A set of experiments is described in the paper in order to advocate the benefits of the proposed approach. The first result is that in the audio-textual domain the proposed architecture outperforms all other proposed approaches by a small margin. Even more interesting is the fact that experiments carried out illustrated a performance boost of 3.1% in the overall performance which according to the paper is the largest boost reported in audio-textual sentiment analysis tasks.

An ablation study suggests that the most important (in terms of accuracy) module is the high-level, then the sentence-level and finally the word-level. The last experiment aims to explore the noise robustness of the model. What one may observe is that for reasonable amount of noise the fusion classifier outperforms the single-modality ones, while a significant noise injection naturally ruins the approach.

The efficacy of the proposed architecture is attributed to the multiple fusion levels which introduce depth, as well as the re-use of both fused and single modality representations.

The paper “Deep Hierarchical Fusion with Application in Sentiment Analysis” [1] presented at Interspeech 2019, Graz, Austria and is publicly available in the following link:
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3243.pdf

[1] Georgiou Efthymios, Charilaos Papaioannou, and Alexandros Potamianos. “Deep Hierarchical Fusion with Application in Sentiment Analysis.” Proc. Interspeech 2019 (2019): 1646–1650.

--

--