NLP Zero to One: Bi-Directional LSTM Part(10/30)

Constant Error Carousel

Kowshik chilamkurthy
Mar 2 · 4 min read
Generated by Author

RNN or LSTM is only concerned about the accumulation of a memory context in the forward direction. But we would also want the model to allow for both the “forward” context and “backward” context to be incorporated into a prediction. This can be achieved if we have a model architecture that run over forward sequence (“He is a good person”) and backward sequence(“person good a is He”). The kind of RNN’s that is specifically built for this kind of bi-directional sequences is called Bidirectional RNN. Bidirectional networks typically outperform forward-only RNNs in most NLP tasks like language modeling, sequence labelling tasks such as part-of-speech tagging, as well as sequence classification tasks such as sentiment analysis and topic classification.

In this blog, we will understand the bi-directional variants in RNN and LSTM’s and also look at few applications.

Bidirectional RNN..

The Bidirectional RNN is just sticking two independent RNNs together. This structure allows the networks to have both backward and forward information about the sequence at every time step. Forward and Backward RNN in the Bidirectional RNN have a different hidden state “hf” and “hb” respectively. The forwards RNN and backward RNN constitute a single bidirectional layer.

For an input sequence X = {x1,x2, . . . ,xT }, the forward context RNN receives the inputs in forward order t ={1,2, . . . ,T}, and the backward context RNN receives the inputs in reverse order t = {T,T −1, . . . ,1}. Here we apply forward propagation 2 times , one for the forward cells and one for the backward cells.

Training

The challenge is to make sure the learning of both networks happens simultaneously and should be synced so the the learning from both the networks can be collated for final prediction. Bi-directional RNN’s propose a simple architecture to stitch both forward and backward RNN networks together. The idea is to run the backward RNN network in the direction opposite to the forward RNN network so the the input and output at every time step in the sequence chain matches exactly.

The error for all time steps is the summation of losses at each time step for both networks and we can sum the gradients for each of the weights in our network and update then with the accumulated gradients. This is very similar to what we do in uni-directional case.

we then concatenate the forward and backward hidden states to obtain the combines hidden state to be fed into the output layer for predicting the output. For Example in a standard NLP task where are trying to predict the next word this combined hidden layer is then used to generate an output layer which is passed through a softmax layer to generate a probability distribution over the entire vocabulary.

Illustration of combined representation

The final hidden units from the forward and backward passes are combined to represent the entire sequence. This combined hidden state serves as an representation of entire sequence .One limitation of bi-directional RNNs is that full input sequence must be known before prediction, this limits the usage of RNN to some practical NLP problems.

Note:

  1. Constant Error Carousel is another very important characteristic of LSTM. It allows LSTM to have a smooth and uninterrupted flow of gradients while propagation. This prevents vanishing gradients, If the forget gate and input gate are mostly 1, the cell state effectively adds up the inputs. These designs are called “constant error carousels”.

2. Teacher Forcing: In the task is to minimise the error in predicting the next word in the training sequence, using cross-entropy as the loss function.

3. Sequence Labelling: The network’s task is to assign a label chosen from a small fixed set of labels to each element of a sequence. Tasks like part-of-speech tagging and named entity recognition comes under this umbrella

Generated by Author

Previous: NLP Zero to One: LSTM Part(9/40)

Next: NLP Theory and Code: Encoder-Decoder Models (Part 11/30)

Nerd For Tech

From Confusion to Clarification