An Implementation of the Hierarchical Attention Network (HAN) in Tensorflow — Part One

8 min readDec 25, 2018

A chapter of my NLP Deep Learning Journey

How it all started

As a Data Scientist and as someone fascinated by Natural Language Processing using Deep Learning, I find reading Deep Learning-based NLP papers very instructive and deeply interesting.

As an ex-Software Engineer, I am tempted to implement some of the ideas in the papers I read, and apply the implementations to real-world problems.

One such real-world problem I recently found is the ‘Quora Insincere Questions Classification’¹ challenge that is currently being hosted on Kaggle.

According to the competition description, an insincere question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics are provided, that can signify that a question is insincere:

Has a non-neutral tone
Has an exaggerated tone to underscore a point about a group of people
Is rhetorical and meant to imply a statement about a group of people
Is disparaging or inflammatory
Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
Makes disparaging attacks/insults against a specific person or group of people
Based on an outlandish premise about a group of people
Disparages against a characteristic that is not fixable and not measurable
Isn’t grounded in reality
Based on false information, or contains absurd assumptions
Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers

I thought I would try to implement a Hierarchical Attention Network (HAN)², to tackle this problem.

So, the question: Why a HAN?

Because:

Hierarchical Attention Networks are considered a state-of-the-art for document classification
There have been attempts to use Attention-based networks for fake news detection³ which led me to wonder if a HAN, which is an attention based document classifier, could work for Insincere Questions
The deciding factor for myself was that the HAN seemed doable to implement in a framework of my choice (Tensorflow)

Okay, so what is a HAN?

The Hierarchical Attention Network (HAN) is a deep-neural-network that was initially proposed by Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy from Carnegie Mellon University and Microsoft Research, Redmond for the task of Document Classification.

Using my own words and terms, a HAN attempts to classify a document based on the knowledge it can infer about the document from its composite parts, in other words the sentences and words that make up the document. The ‘hierarchical’ in HAN comes from the design that this knowledge is built hierarchically, starting from using the words in a sentence and followed by using the sentences in a document.

Technically speaking :

1. Each word in a sentence is first converted into a word embedding representation⁴.

An embedding of a word, is a collection of numbers (typically in a ‘vector’ format) that captures the ‘meaning’ of the word as it is most widely understood.

The competition provided GLoVE 300-dimensional embeddings for around ~2.2 million words, which were directly used.

[Two separate entries were added to the embeddings dictionary provided, representing a NULL token for padding, and an UNK token for unknown words in the document]

2. Each sentence with its embedded words is fed to a bi-directional GRU encoder.

**Natural Language Processing? Easy! Bring it on!**

The encoder, aims to find a ‘representation’ of the sentence based on the words the sentence is made of while using the embeddings of the words to do so. An important output of the process, is the intermediate ‘hidden state’ generated for each word by the GRU.

As the words of each sentence, in their embedding form, go through the encoding process, intermediate outputs called ‘hidden states’ are produced at each word-step in the sentence. These ‘hidden states’ are essentially more collections of numbers (a.k.a vectors) produced at each word-step in the sentence, with one vector mapped to each word

[Though in practice, in the Tensorflow implementation, each word has a collection of such vectors, with the collection representing different feature sets for that word, not unlike the ‘depth’ of a filter set used in CNNs].

The ‘hidden state’ for a word-step is generally produced as the result of awesomely complex mathematical combinations⁵ of the embedding input for the word, and the hidden state of the previous word, which combinations aim to balance the influence of the embedding input for the word versus that of the hidden state of the previous word, on the hidden state produced for the word. But, take note (in my own words and terms…)

It can be seen that as a GRU is evaluated in sequence of word-steps, the hidden states of the final word in a sentence can become cool and funky combinations of the hidden states of words preceding it, and are influenced by them. Similarly, the final output of the encoder after all words are processed (assuming a forward-direction only for now) contains cool and funky combinations of all the words in the sentence, and hopefully ‘represents’ something from all the words in the sentence — perhaps, not dissimilar to a human reading the words of a complete sentence from start to finish and getting a ‘feeling’ or ‘takeaway’ message from the process of reading that sentence.

A Bi-directional GRU encoder, then, creates these ‘hidden states’ by evaluating the word-steps in both directions — front to back and back to front (as if forwards was not complicated enough…).

At the end of the encoding process, each word-step ends up with a ‘forward’ hidden state and a ‘backward’ hidden state, which at this point hopefully represent the word as it relates to the other words in the sentence under the aegis and context of the encoder’s quest to determine a ‘Meaning’ or ‘Representation’ of the sentence, and yes, there are two relations, one forward and one backward.

Before we go to the next step, the two hidden states are ‘concatenated’ or combined as input to the next step. [In the Tensorflow implementation, the feature sets for each word-step produced by the forward and backward pass are stacked to form a bigger feature set that is twice as large].

3. A ‘Representation’ of the sentence is created using a ‘ Word Attention Context’.

**Think of the Attention Context as…well…**

Think of the ‘Attention Context’ as a big, worldly-wise brain with great Knowledge (usually from lots of deep-learning training) that now looks at the combined ‘hidden states’ of each word in a sentence, somehow ‘figures out’ which words are important in determining the ‘Representation’ of the sentence, gives these words a higher weightage, and then comes up with a final collection of numbers (a.k.a ‘sentence vector’) that ‘Represents’ the sentence from its constituent and appropriately weighted words in the context of the classification task.

What may be key in understanding what I believe is the essence of the HAN is that this attention context or worldly-wise brain learns its Knowledge from the data that it sees and the purpose that the data is put to, i.e a labeled classification task.

Because the attention context is a trainable set of parameters, it can be trained to associate certain words or collections of words with the label of the document, and weight them appropriately during an inference task. As is the case in Deep Learning, more data samples can make this worldly-wise brain more astute, given the context of its function.

With the Attention Context having contextually evaluated and come up with a Sentence Representation, it’s time to take things to the next level…

4. Each document with its representation of sentences is fed to a bi-directional GRU encoder.

Yes, you guessed it. Another GRU encoder enters the picture, only this time it generates ‘hidden states’ for each sentence in the document from the sentence vectors for each sentence, with the aim to come up with a ‘Document Representation’ in the process.

Forward and backward passes of the sentence vectors in a document are used during the encoder’s quest to generate a document Representation, in the context of the classification task. These passes result in a forward hidden state and a backward hidden state for each sentence, which are stacked together (a.k.a concatenated) as input to the next step of the process. The essence of the process is exactly the same as the encoder for words in a sentence, but applied here at the next step in the hierarchy for sentences in a document and using the previously created ‘sentence vectors’.

5. A Representation of the Document is created using a ‘Sentence Attention Context’

A second Attention Context or worldly-wise brain enters the picture, this time learning whatever it can about the weightages (or ‘attention’) it should give to sentences in a document as it tries to classify the document to a label.

The concept is exactly the same as the ‘Word Attention Context’, just applied to the next step of the hierarchy, and producing a ‘document vector’ in the process.

To my understanding, The ‘Word Attention Context’ and the ‘Sentence Attention Context’ turn out to be the essences of and the most important parameters learnt by a Hierarchical Attention Network.

6. The document vector is run through a classifier

Because the document vector consists of a collection of many feature sets, any linear classifier can be applied at this point.

For the Insincere Questions classification task, being a binary classification task, a final weight multiplication (plus bias add) was applied to the document vector and the output squashed through the logistic unit for the classification.

7. And just because this is supposed to be a serious post..

The official diagram of a HAN is shown below from the original paper by Yang et al. (2016). It is my hope that, if you’ve somehow made it to this point, the diagram below makes some sense to you.

[The softmax at the end is used for multi-class classification. us and uw on the left of the diagram are Pinky and the Brain (i.e, the Sentence Attention Context and the Word Attention Context) , respectively]

Great, now i (think) I know what a HAN is. Where’s the Tensorflow implementation!?

Stay tuned for Part Two of this post, where the Tensorflow implementation of the above concepts are discussed in detail (with a github link!)

EDIT: Part two is now available here

Thanks for reading.

References

https://www.kaggle.com/c/quora-insincere-questions-classification
Hierarchical Attention Networks for Document Classification by Yang et al (2016): https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf
3HAN, a Deep Neural Network for Fake News Detection by Singhania et al (2017) http://infosource.biz/srao/papers/3han.pdf
GLoVE: https://nlp.stanford.edu/projects/glove/
Machine Translation and Advanced Recurrent LSTMs and GRUs: https://www.youtube.com/watch?v=QuELiw8tbx8&index=9&list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6