A Paper A Day: #25 Hierarchical Attention Networks for Document Classification
Today we discuss a paper by Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy for document classification using hierarchical attention networks.
The paper proposes a hierarchical attention network for document classification. The model has two distinctive characteristics:
- It has a hierarchical structure that mirrors the hierarchical structure of documents;
- It has two levels of attention mechanisms applied at the word- and sentence-level, enabling it to attend differentially to more and less important content when constructing the document representation.
Experiments conducted on six large scale text classification tasks demonstrate that the proposed architecture outperform previous methods by a substantial margin. Visualization of the attention layers illustrates that the model selects qualitatively informative words and sentences.
Hierarchical Attention Networks
The overall architecture of the Hierarchical Attention Network (HAN) is shown in the figure below. It consists of several parts: a word sequence encoder, a word-level attention layer, a sentence encoder and a sentence-level attention layer.
Assume that a document has L sentences Si and each sentence contains Ti words. The proposed model projects the raw document into a vector representation, on which a classifier is built to perform document classification.
Word Encoder: Given a sentence with words Wit , t ∈ [0, T], the words are embedded to vectors through an embedding matrix We. A bidirectional GRU is used to get annotations of words by summarizing information from both directions for words, and therefore incorporate the contextual information in the annotation.
Word Attention: Not all words contribute equally to the representation of the sentence meaning. Hence, an attention mechanism is used to extract such words that are important to the meaning of the sentence and the representation of those informative words is then aggregated to form a sentence vector.
Sentence Attention: to reward sentences that are clues to correctly classify a document, again an attention mechanism is used and a sentence level context vector is used to measure the importance of a sentence.
The effectiveness of the model is evaluated on six large scale document classification data sets. These data sets can be categorized into two types of document classification tasks: sentiment estimation and topic classification. 80% of the data is used for training, 10% for validation, and the remaining 10% for test.
Experimental results demonstrate that the model performs significantly better than previous methods. Visualization of these attention layers illustrates that our model is effective in picking out important words and sentences.