In the last blog we discussed vanilla RNN architecture and we also discussed its limitations. Vanishing gradients is a very important drawback that limits the RNN to model shorter sequences. Vanilla RNNs fail to learn in the presence of time lags greater than 5–10 discrete time steps between relevant input events and target signals. This basically limits vanilla RNN’s application to many practical problems and NLP in specific as the number of words in sentences are often far longer than 10.
Long Short-Term Memory (LSTM) is an popularized as a better variant to the recurrent neural network architecture specifically designed to address the vanishing gradient problem. LSTM tweaks the internal structure of the RNN recurrent unit to bring a solution to the vanishing gradients. LSTM are applied with great success to translation and sequence generation. In this blog, we will discuss the neural architecture of LSTM. Please refer to the my previous blog, if you are not familiar with RNN.
LSTM are lot similar to what we learnt about RNN, it has a similar control flow as a recurrent neural network. In RNN’s the information (hidden state/ gradients) is passed uninterrupted across time steps when doing a back-propagation. What LSTM does is simply utilises simple gates to control the gradient propagation in the recurrent network unit. With different gates LSTM memory unit processes data passing on information as it propagates forward. Let’s see how this information is processed in LSTM memory unit, first let’s define cell state and then we will define the gates that are used to process the information.
Cell state is like a flowing memory of the network. You can imagine the cell state is kind of like a conveyor belt that carries relevant information throughout across the time steps. This is the information that our memory unit sees are every point of time and this cell state “Ct” can be seen as a summary of what model has seen/learnt till time “t”. It runs straight down the entire sequence, with only some minor linear interaction as you can see in the plot. This concept of cell state can enable the information from the earlier time steps can make it’s way to later time steps, reducing the effects of short-term memory.
The gates are different neural networks that decide which information is allowed to forget, ignore, or keep information in the memory cell. The gating mechanisms are themselves neural network layers, so the gates in the LSTM memory unit learnt from the data it has seen. Let’s discuss each gate carefully.
1. Forget Gate Layer
First of all , we would want to decide the information(“Ct”) it had learnt till time step “t” is useful to what extent. The forget gate layer takes care of this operation.
The forget gate layer takes“ ht−1”and “xt” and has output from 0 to 1. We can enforce this output because its a sigmoid function. 1 represents “completely keep the cell state Ct-1” while a 0 represents “completely get rid of Ct-1.”
2. Input Gate Layer
Next, At every time step “t” our memory unit takes an input “xt”. The question is to what extent that the input “xt” is stored into the cell state?. This input gate layer decides how much we keep this new information “xt” in the cell state. This happens in two parts.
1. A tanh layer which takes “ ht−1”and “xt” and creates a candidate values, “C̃ t”, that could be added to the cell state .
2. A sigmoid layer which takes “ ht−1”and “xt” and outputs values from 0 to 1 which decides how much we will allow candidate values, “C̃ t”.
After thus 2 gates the older cell state “Ct−1” is now updates to the new cell state “Ct”.
3. Output Gate Layer
Finally, we need to decide what we’re going to be hidden state at time t, “ht ”. This output will be a function of updated cell state “Ct”. This also happens in 2 steps
1. A tanh layer which takes “Ct” as input and outputs a vector “V”
2. A sigmoid layer which which takes “ ht−1” and “xt” and outputs values from 0 to 1 which is the multiples to vector “V” to get output “ot” or hidden state “ht”.
Complete Memory Unit
The training process of LSTM is very similar to the training process of RNN. Its important to re-emphasise that the gates present in the LSTM memory unit are also parameterised and thus these parameters are also tuned/updated with gradient descent optimisation methods.