A deep dive into the world of gated Recurrent Neural Networks: LSTM and GRU
In the world of deep learning, the RNN is considered as the go-to model whenever the problem requires sequence-based learning and this has propelled the research community to come up with interesting improvements over the vanilla RNN.
One such prominent improvement is the introduction of gated RNNs: the LSTM and GRU.
Gated RNN architectures like the Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) have taken the world of deep learning by a storm.
In this post, we will take a closer look at both LSTM and GRU to understand what makes them so effective. Eventually, we will compare both of these models and see how GRU, which is a much simpler version of LSTM performs better in many scenarios.
Table of Contents:
- Recalling Gated RNN
- LSTM Architecture
- Delving deep into LSTM
- Variant of LSTM
- GRU Architecture and Advantages
For a better understanding of this article I recommend you to go through my previous post: Understanding Gated RNN.
Flashback: Recalling the gated RNN
As we know, the gated RNN architecture has three gates which controls the flow of information in the network, namely:
- Input Gate/Write Gate
- Keep Gate/Forget Gate
- Output Gate/Read Gate
We have also seen that the values that these gates can possess, need not be binary and can always take up the values between 0 and 1 and hence becomes an ANN architecture with the inputs being the input itself.
Generally, all basic RNN’s have repeated modules of architecture which is as below:
LSTM Architecture
The basic architecture shown above is being implemented while building an LSTM model with a slight modification in the repeating modules.
Now, let’s get into understanding the notations one by one:
Delving Deep into the operations of LSTM:
Now, let’s breakdown step-by-step operations in the LSTM architecture to make it simple. 😊
Any LSTM model has cell state and hidden state at any given time step t. Here the horizontal line indicates the cell state of the model which helps in preserving the information for a long time and hence the name long-term memory, whereas, the hidden state is a function of the cell state which is used while predicting shorter contexts.
Intuition: Consider a text where-in the main subject of the text is about a place and for our convenience let’s consider it to be “India”. Now, while describing that in a statement the person may not repeat the word “India” many times but rather he/she would mention it at the start of the text. So, it is important for the model to keep the word “India” in its memory for long enough so as to predict something about the text. This is where cell state preserves the information by storing it in its long-term memory whereas, the hidden state keeps shorter contexts.
Here, the cell state may add or remove the information which are controlled by the gates and these gates are composed out of a sigmoid neural net layer and point-wise multiplication operation. The output of this neural net is always between 0 and 1 where 0 means “let nothing through” and 1 means “let everything through”.
The first step in LSTM model is to decide how much information we are going to store and how much we are going to ignore which is given by the output of a neural net layer controlled by our forget gate.
The value of ft is always between 0 and 1 which is then combined with each element of the previous cell state Ct-1 to give an output.
The next step is deciding on writing the new information to the cell state which essentially has two parts with a sigmoid function activation and tanh activation on the input gate layer. Here the sigmoid function decides which values to update and the tanh function creates the new vector values. The equations to these two functions are as shown below:
Note: You may very well have a doubt that while creating vector with new values, why use tanh and not a ReLu activation function? …. It is because the ReLu function ranges between 0 to ∞ which only means that we are always adding the values to the cell state which would blow up the weights, whereas the tanh function ranges between -1 and 1 due to which the values can be controlled by the algorithm.
Now, let’s get to the final step by combining the above two obtained vectors to get the final equation for Ct.
In our example this is where we want to add the new subject by forgetting the old one which was “India”.
Now that we have obtained the vector Ct we now have to decide what we’re going to output. The output is again essentially two parts which depends on the sigmoid output of the input layer which is then multiplied by tanh activation so that we get the updated values for only those which we wanted i.e. we decide how much information to send to the upper layer and how much information is used to predict the output.
If we carefully observe, this whole sigmoid layer function is the function of output gate/read gate. The respective equations of the layer are as follows:
Variant of LSTM models:
If we observe the final equation of Ct the coefficients ft and It can theoretically take same values i.e. both the values can be 0 or 1. So, what happens if both the values are 0?… it essentially means that model is erasing the past memory and also, not writing anything new. What if both the values are 1?… it means that the whole of past memory is important and also, we are writing completely new to the memory which doesn’t make sense in either case.
So, a slight modification was made to the current architecture where-in the equation of Ct was changed to be
The above equation implies that if we are looking to keep the whole of past information then we aren’t writing anything new to the memory and if we choose to erase the memory completely then we are writing new to the memory completely. This way, the model ensures that the input and the forget layers are not assuming the same values at any given time step.
This is where we move slowly into GRU’s:
The Gated Recurrent Units have the above-mentioned modification along with few others due to which it has an edge over LSTM and is getting adapted by most of them in the practical world.
GRU: Architecture and Advantages
The GRU’s has a slightly different architecture where it combines both the forget and input gate into a single gate called the update gate. Also, it merges the cell state and hidden state along with few changes which results in an architecture as below:
The main advantage of GRU is the reduced number of parameters as compared to LSTM’s without any compromise whatsoever which has resulted in faster convergence and a more generalized model.
This has been the main reason behind the adaptation of GRU’s over LSTM’s in the current practical scenarios.
I really appreciate your patience in getting through to this point. I know that this article was a bit long. Hope, you now have a clear picture of LSTM and GRU’s. Any feedback is much appreciated.