A Paper A Day: #25 Hierarchical Attention Networks for Document Classification

2 min readJun 17, 2017

Today we discuss a paper by Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy for document classification using hierarchical attention networks.

Summary

The paper proposes a hierarchical attention network for document classification. The model has two distinctive characteristics:

It has a hierarchical structure that mirrors the hierarchical structure of documents;
It has two levels of attention mechanisms applied at the word- and sentence-level, enabling it to attend differentially to more and less important content when constructing the document representation.

Experiments conducted on six large scale text classification tasks demonstrate that the proposed architecture outperform previous methods by a substantial margin. Visualization of the attention layers illustrates that the model selects qualitatively informative words and sentences.

Hierarchical Attention Networks

The overall architecture of the Hierarchical Attention Network (HAN) is shown in the figure below. It consists of several parts: a word sequence encoder, a word-level attention layer, a sentence encoder and a sentence-level attention layer.

Hierarchical Attention Network (credit: Yang et al. 2016)

Assume that a document has L sentences Si and each sentence contains Ti words. The proposed model projects the raw document into a vector representation, on which a classifier is built to perform document classification.

Word Encoder: Given a sentence with words Wit , t ∈ [0, T], the words are embedded to vectors through an embedding matrix We. A bidirectional GRU is used to get annotations of words by summarizing information from both directions for words, and therefore incorporate the contextual information in the annotation.

Word Attention: Not all words contribute equally to the representation of the sentence meaning. Hence, an attention mechanism is used to extract such words that are important to the meaning of the sentence and the representation of those informative words is then aggregated to form a sentence vector.

Sentence Attention: to reward sentences that are clues to correctly classify a document, again an attention mechanism is used and a sentence level context vector is used to measure the importance of a sentence.

Experiments

The effectiveness of the model is evaluated on six large scale document classification data sets. These data sets can be categorized into two types of document classification tasks: sentiment estimation and topic classification. 80% of the data is used for training, 10% for validation, and the remaining 10% for test.

Experimental results demonstrate that the model performs significantly better than previous methods. Visualization of these attention layers illustrates that our model is effective in picking out important words and sentences.

A Paper A Day: #25 Hierarchical Attention Networks for Document Classification

Summary

Hierarchical Attention Networks

Experiments

Written by Amr Sharaf