An Implementation of the Hierarchical Attention Network in Tensorflow — Part Two

9 min readDec 25, 2018

A chapter of my NLP Deep Learning Journey

In my last post, the Hierarchical Attention Network and its architecture was (somewhat irreverently) discussed in detail. This post can be found here

This post will focus on the technical aspects of implementing the network in Tensorflow, and my learnings along the way.

Recap — Hierarchical Attention Network (HAN)¹

A Diagram of the architecture of the HAN is reproduced below, from the original paper. The main difference between the diagram and the Tensorflow implementation presented here is that the last softmax classification later is replaced by a binary classification layer.

Taken from the paper by Yang et al. (2016)

Key Learnings and Challenges

The key learnings and challenges faced with the Tensorflow implementation were as follows:

A key challenge in creating a vectorized implementation was around multiplying the trainable word and sentence context vectors with the word and sentence hidden states to generate the attention weights
Another key challenge in creating a vectorized implementation was around ‘rolling up’ the sentence vectors into the correct document, after processing the word embeddings on a flattened sentence matrix. This is challenging because there are different numbers of sentences in each document.
Related to the above point is the realization that the X-features and Y-label (which is applied at the document level) must be in sync, following the rolling up of sentence vectors process.
Performance was a key challenge, given the size of the word embedding matrix provided by the competition (~2.2 million words).

Pre-processing

The pre-processing steps are:

Convert words to GLoVE word embeddings², using a manually inserted UNK embedding with values picked randomly from a Gaussian Distribution, and a NULL embedding if zero vectors which is used during the padding process
Pad sentences and documents with zeroes during the batching process
Get rid of all punctuation after splitting the document into constituent sentences. Keep only characters (with case) and digits.

Stopwords were kept, and no stemming / lemmatization was done given word embeddings were available.

A challenge (Challenge #4) was to find a way to avoid passing the entire embedding matrix to a Tensorflow placeholder during every mini-batch run, for performance reasons, given that the size of the matrix with 2.2 million words and 300-dimensions (2.2m by 300 matrix) was ~5.6GB.

A solution was to load the entire embedding matrix into memory via numpy, and lookup and extract only the indices of the words in the mini batch of sentences that were required into a separate numpy array, and feeding this smaller numpy array to the Tensorflow input placeholder directly. This too was found to have issues of its own, not the least being loading memory issues.

The final solution uses Numpy’s memory-maps³, which is a memory map to a numpy array that is stored on disk. This frees up CPU memory for better utilization, and by extracting only a subset of embeddings at each mini-batch run from the array on disk to feed to Tensorflow placeholder, performance is much faster. Finally, the memory map need be initialized and created only once and read from disk when needed, instead of initializing and loading the entire embedding array to memory every time the application is run. The only downside to the approach is that the embeddings are no longer trainable and cannot be fine-tuned for the classification task at hand.

The Tensorflow Graph

This is the Graph of the implementation, from the Tensorboard Visualization:

The following is the full Tensorflow Graph of my implementation of the paper by Yang et al. (2016). Let’s walk through it step-by-step:

1. layer_inputs

This layer receives a numpy word embedding of the sentences batch directly (this was done for performance reasons , as mentioned earlier).

At this step, all the sentences in a document have been flattened out into separate rows for the mini-batch size. This allows efficient vectorized processing of the sentences, but keep in mind there is additional work to be done in rolling up the sentence vectors for the next step in the hierarchy.

# Now load the inputs and convert them to word vectors
with tf.variable_scope("layer_inputs"):    inputs = tf.placeholder(dtype=tf.float32, shape=[None,max_sentence_len,glove_dim],name="input")    tf.placeholder(dtype=tf.int32,name="sequence_length")

2. layer_word_hidden_states

This layer runs a bi-directional GRU⁴ and gets the forward and backward hidden states for each word in each sentence. A vector of the sentence lengths (number of words in each sentence without padding) in each mini-batch is also required by the Tensorflow API. The forward and backward states are concatenated and ready for the next step.

with tf.variable_scope("layer_word_hidden_states"):    ((fw_outputs,bw_outputs),
     _) = (
        tf.nn.bidirectional_dynamic_rnn(cell_fw=cell_fw,
                                        cell_bw=cell_bw,
                                        inputs=inputs,
                                      sequence_length=batch_sequence_lengths,
                                        dtype=tf.float32,
                                        swap_memory=True,
                                        ))outputs_hidden = tf.concat((fw_outputs, bw_outputs), 2)

3. layer_word_attention

This layer develops and trains the word attention context.

First, the word attention context vector is initialized. Second, the word hidden states are put through a non-linearity (tanh in the original paper, but relu is used below) to get ready for the word attention context vector.

Now we face Challenge #1 — how to multiply the non-linearity output with the context vector, in a vectorized way. The solution is to use the tf.tensordot API⁵ for 3-D tensor multiplication, which multiplies the context vector of shape [# features] with the vectorized word annotations of shape [batch_size, num_words, #features] to return a matrix of shape [batch_size,num_words]

The output is then fed through a custom softmax function (not shown here), which calculates the attention weights only across the valid words in a sentence, basically ignoring the padding applied for shorter sentences. The attention matrix is then multiplied (scalar multiplication) with the original word hidden state tensor, and the product is reduced by summation across the words in each sentence (axis = 1).

The end result is a matrix of shape [batch_size, #features * 2], i.e a set of sentence vectors (equal to #features) for each sentence which have been scored by the word attention context vector.

with tf.variable_scope("layer_word_attention"):

    initializer = tf.contrib.layers.xavier_initializer()

    # Big brain #1
    attention_context_vector =  tf.get_variable(name='attention_context_vector',shape=[output_size ],initializer=initializer,dtype=tf.float32)

    input_projection = tf.contrib.layers.fully_connected(outputs_hidden, output_size ,
activation_fn=tf.nn.relu)    vector_attn = tf.tensordot(input_projection,attention_context_vector,axes=[[2],[0]],name="vector_attn")    attn_softmax = tf.map_fn(lambda batch:
                           sparse_softmax(batch)
                           , vector_attn, dtype=tf.float32)

    attn_softmax = tf.expand_dims(input=attn_softmax,axis=2,name='attn_softmax')

    weighted_projection = tf.multiply(outputs_hidden, attn_softmax)    outputs = tf.reduce_sum(weighted_projection, axis=1)

4. layer_gather

Now we come to the most fun part of the implementation:

Remember that the earlier step produced sentence vectors for a batch of flattened sentences (i.e unrolled from their documents and grouped together) for efficient vectorization. To proceed with repeating the process at the document level, the sentences need to be rolled up into its document. This is conceptually challenging, given that there is no fixed number of sentences in a document. This is Challenge #2 (and by extension, Challenge #3)

The solution adopted was to loop through the theoretical number of sentences in a document (in practice, upto a max of 12 sentences in a document as present in the data, but a cutoff of number of sentences in a document can be applied as hyper-parameter), and for each theoretical number-of-sentences-in-a-document parameter (i.e from 1–12), filter in all the sentence vectors with that theoretical number of sentences in a document (i.e if there are two sentences in a document, filter in all sentences from the flattened sentence matrix that are part of documents with two sentences only).

Applying a tf.reshape function at this point to roll up the filtered sentences into their documents is trivial. This process is repeated until the loop completes, and the outputs from each loop are concatenated to get a batch_size equal to the number of documents in the original batch extracted from the raw file before the flattening process. Note that this process assumes that the sentence order from the raw document file was preserved during the flattening process. Challenge #2 solved.

A similar process is followed to re-order the labels to the documents during the rolling up process; in essence the label is broadcast to all the sentences in a document during the flattening process, and then rolled up and reduced by max (or average/min given this is binary classification) operation to re-map the label to the document. Challenge #3 solved.

The last detail remaining — how to implement a loop in Tensorflow, given its static graph and lazy execution nature?

tf.while_loop⁶ to the rescue!

tf.while_loop takes in a while condition (implemented as while_cond below) and a loop body (implemented as body below). The loop loops from integer 1 to the max number of sentences in a document in your data. The body returns a padded, rolled up sentence vectors as a document matrix, ready for the sentence encoder part after concatenation across the full batch / mini-batch.

To ensure that the loop outputs values of an expected shape that can be concatenated after each loop step, the ‘shape_invariants’⁶ parameter of the API was used. This functions like an assert statement, checking the shapes of the outputs at the end of each loop.

By looping across the number of sentences in a document parameter, and aggregating the results of each loop, a partially vectorized implementation of the sentence rolling up process to documents, has been achieved.

with tf.variable_scope('layer_gather'):

    # Initialization
    tf_padded_final = tf.zeros(shape=[1,sent_cutoff_seq,output_size * 2])
    tf_y_final = tf.zeros(shape=[1,1],dtype=tf.int32)
    sentence_batch_len = tf.placeholder(shape=[None],dtype=tf.int32,name="sentence_batch_len")    sentence_index_offsets = tf.placeholder(shape=[None,2],dtype=tf.int32,name="sentence_index_offsets")

    sentence_batch_length_2 = tf.placeholder(shape=[None],dtype=tf.int32,name="sentence_batch_len_2")    ylen_2 = tf.placeholder(shape=[None],dtype=tf.int32,name="ylen_2")   

    i = tf.constant(1)    #This rolls up sentences dyamically, each sentence-batch-shape at a time.
    def while_cond (i, tf_padded_final, tf_y_final):
        mb = tf.constant(sent_cutoff_seq)
        return tf.less_equal(i,mb)

    def body(i,tf_padded_final,tf_y_final):
        tf_mask = tf.equal(sentence_batch_length_2,i)
        tf_slice = tf.boolean_mask(outputs,tf_mask,axis=0)
        tf_y_slice = tf.boolean_mask(ylen_2,tf_mask,axis=0) # reshaping the y to fit the data

        tf_slice_reshape = tf.reshape(tf_slice,shape=[-1,i,tf_slice.get_shape().as_list()[1]])
        tf_y_slice_reshape = tf.reshape(tf_y_slice,shape=[-1,i])
        tf_y_slice_max = tf.reduce_max(tf_y_slice_reshape,axis=1,keep_dims=True) # the elements should be the same across the col

        pad_len = sent_cutoff_seq - i

        tf_slice_padding = [[0,0], [0, pad_len], [0, 0]]
        tf_slice_padded = tf.pad(tf_slice_reshape, tf_slice_padding, 'CONSTANT')
        

        tf_padded_final = tf.concat([tf_padded_final,tf_slice_padded],axis=0)
        tf_y_final = tf.concat([tf_y_final,tf_y_slice_max],axis=0)

        i = tf.add(i,1)

        return i, tf_padded_final, tf_y_final

    

    _, tf_padded_final_2, tf_y_final_2 = tf.while_loop(while_cond, body, [i, tf_padded_final, tf_y_final],shape_invariants=[i.get_shape(),tf.TensorShape([None,sent_cutoff_seq,output_size_sent * 2]),tf.TensorShape([None,1])])

# Give it a haircut
tf_padded_final_2 = tf_padded_final_2[1:,:]
tf_y_final_2 = tf_y_final_2[1:,:]

5. layer_sentence_hidden_states

This layer applies the sentence encoder to the output of the previous step. A vector of the number of sentences in each document in the batch is expected by the dynamic_rnn API (sentence_batch_len, below). The forward and backward states are concatenated, as was done with the word-level hidden states.

with tf.variable_scope('layer_sentence_hidden_states'):

    ((fw_outputs_sent, bw_outputs_sent),
     _) = (
        tf.nn.bidirectional_dynamic_rnn(cell_fw=cell_sent_fw,
                                        cell_bw=cell_sent_bw,
                                        inputs=tf_padded_final_2,
                                        sequence_length=sentence_batch_len,
                                        dtype=tf.float32,
                                        swap_memory=True,
                                        ))
    outputs_hidden_sent = tf.concat((fw_outputs_sent, bw_outputs_sent), 2)

6. layer_sentence_attention

The Sentence Attention Context is developed here. The process is exactly similar to the development of the word attention context, but done at the next level of the hierarchy. The output is a matrix of shape [document_batch_size, # features], which is a document-level representation and is ready for a classifier.

with tf.variable_scope('layer_sentence_attention'):

    initializer_sent = tf.contrib.layers.xavier_initializer()

    # Big brain #2 (or is this Pinky..?)
    attention_context_vector_sent = tf.get_variable(name='attention_context_vector_sent',shape=[output_size_sent ],initializer=initializer_sent, dtype=tf.float32)

    input_projection_sent = tf.contrib.layers.fully_connected(outputs_hidden_sent, output_size_sent,
                                                         activation_fn=tf.nn.relu)
    vector_attn_sent = tf.tensordot(input_projection_sent, attention_context_vector_sent, axes=[[2], [0]], name="vector_attn_sent")    attn_softmax_sent = tf.map_fn(lambda batch:
                             sparse_softmax(batch)
                             , vector_attn_sent, dtype=tf.float32)

    attn_softmax_sent = tf.expand_dims(input=attn_softmax_sent, axis=2, name='attn_softmax_sent')

    weighted_projection_sent = tf.multiply(outputs_hidden_sent, attn_softmax_sent)    outputs_sent = tf.reduce_sum(weighted_projection_sent, axis=1)

7. layer_classification

A logistic unit was applied after applying a simple weight multiply and bias addition operation to the output of the previous layer. The code is not shown.

8. Loss and Optimizer

Binary Cross entropy loss was applied, with weights given that the dataset is very imbalanced (6% positive class only). The Tensorflow API used is tf.nn.weighted_cross_entropy_with_logits⁷.

Adam optimizer was used with initial learning rate of 0.001. It was interesting to note that Adam trained faster than the proposed SGD with Momentum optimizer initially proposed by the paper authors. Mini-batch training was used, as recommended by the authors.

Available Model Hyper-parameters

The following model hyper-paramters were incorporated into the implementation:

learning_rate for optimizer
Cutoff for number of words in a sentence
Cutoff for number of sentences in a document
Number of feature sets for RNN cell (GRU / LSTM)
Threshold for predicting positive class
mini-batch-size
weight to attach to the positive label in the loss API

Results on Kaggle Competition ‘Quora Insincere Questions Classification’

The best F1 score as submitted to Kaggle was 0.635 (at the time of this writing). For comparison, the top F1 score in the competition at the time of this writing, is 0.711.

Possible improvements

The HAN as implemented has the limitation that it is difficult to incorporate n-gram features given the sequential nature of RNN processing.

A way around this may be to use a CNN to extract features by possibly applying a convolution window of different n-gram lengths across the sentences and take a max pooling of the intermediate features extracted for the different window sizes. These max pooled features can be concatenated to give more features for the document, which can be combined with the HAN-extracted features and fed to an appropriate classifier.

The CNN feature extraction could be done for the entire document (similar to the paper ‘Convolutional Neural Networks for Sentence Classification’ by Kim (2014))⁸

or hierarchically, at the sentence then the document level.

Suggestions and Feedback are welcome.

Github!

The github link for this project is present here: https://github.com/nitinvwaran/kaggle_projects

The project is currently ongoing!

References

Hierarchical Attention Networks for Document Classification by Yang et al. (2016): https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf
GLoVE: Global Vectors for Word Representation: https://nlp.stanford.edu/projects/glove/
Numpy memory maps documentation: https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.memmap.html
Machine Translation and Advanced Recurrent LSTMs and GRUs (Lecture by Stanford University School of Engineering) https://www.youtube.com/watch?v=QuELiw8tbx8&index=9&list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6
API Documentation link: https://www.tensorflow.org/api_docs/python/tf/tensordot
API Documentation link: https://www.tensorflow.org/api_docs/python/tf/while_loop
API documentation for link: https://www.tensorflow.org/api_docs/python/tf/nn/weighted_cross_entropy_with_logits
Convolutional Neural Networks for Sentence Classification by Yoon Kim (2014): https://www.aclweb.org/anthology/D14-1181