Bi-directional RNN & Basics of LSTM and GRU

Madhu Ramiah
Analytics Vidhya
Published in
5 min readJul 9, 2019

We discussed about the vanishing and exploding gradients in my previous blog about Recurrent Neural Networks. In this blog we will talk a about how to handle these issues and what models we will use as well. In addition to this, we will also talk about bi-directional RNN’s and where they are used.

Let’s discuss a little bit about online-offline modes before I proceed to bidirectional RNN’s. In text summarization, you will basically need to see the whole text before you can make a summary of it. This is called ‘Offline’ mode. At the same time, when you consider the auto compose feature when you compose an email, it does not have access to any future words but only previous words (sequence). This type is called ‘Online’ mode. When we consider the offline sequence model, though we know the whole text before hand, the RNN learns only word by word, from start to end. But, it would be even more beneficial when the model could know the future words as well, so it can solve the problem more effectively. For this case, we use Bi-directional RNN’s.

Bi-Directional Recurrent Neural Network:

In a bidirectional RNN, we consider 2 separate sequences. One from right to left and the other in the reverse order. But, now comes the question how would you combine both of the RNN’s together. Look at the figure below to get a clear understanding.

Bi Directional RNN for word sequence

Consider the word sequence “I love mango juice”. The forward layer would feed the sequence as such. But, the Backward Layer would feed the sequence in the reverse order “juice mango love I”. Now, the outputs would be generated by concatenating the word sequences at each time and generating weights accordingly. This can be used for POS tagging problems as well.

LSTM (Long Short Term Memory)

When we have a small RNN, we would be able to effectively use the RNN because there is no problem of vanishing gradients. But, when we consider using long RNN’s there is not much we could do with the traditional RNN’s and hence it wasn’t widely used. That is the reason that lead to the finding of LSTM’s which basically uses a slightly different neuron structure. This was created with one basic thing in mind- the gradients shouldn’t vanish even if the sequence is very large.

  1. In LSTM, we will be referring to a neuron as a cell. In a traditional RNN, the only way the model can remember something is through updating the hidden states and their respective weights. But, in LSTM this problem is solved by using an explicit memory unit for learning and remembering tasks. It stores information that is relevant for learning.
  2. It also using something called “Gating Mechanism”, which regulates information that the network stores-if it has to pass the information to the next layer or forget the information it has.
  3. Constant Error Carousel is another very important characteristic of LSTM. It allows LSTM to have a smooth and uninterrupted flow of gradients while propagation.
LSTM Architecture
  1. The big rectangular box is called ‘cell’ which takes an input x(t) at time t, a previous hidden layer h(t-1) and a previous cell state c(t-1). The cell state is nothing but the explicit memory unit.
  2. The cell gives 2 outputs- 1 is the output of the hidden state h(t) and the other is the output of the cell state c(t) at any given time t.
  3. The constant error carousel is responsible for transferring the gradients smoothly from c(t-1) to c(t)
  4. A sigmoid function will output values between 0 and 1, while tanh function will output values between -1 and 1. These are the 2 main activation functions that we will use in LSTM.
  5. We combine the inputs from x(t) and h(t-1) into a sigmoid activation function and we do a multiplication operation of it with the previous cell state c(t-1). This multiplication operation is called ‘gate’. If the value from the sigmoid function is close to 1, then the multiplication will lead to a value close to c(t-1), that means erase only little of the previous memory but retain most of it. In the contrary, if the sigmoid function is close to 0, then the multiplication will lead to a value that would be close to 0. This means erase almost everything from the previous cell state (memory). This whole part is called as ‘Forget Gate’
  6. The next gate is called ‘Update Gate’ which uses a sigmoid and a tanh function, which will both have a multiplication gate followed by an addition gate with output from ‘Forget Gate’. The ‘tanh’ function controls how much to increase or decrease the value of the next cell state. The sigmoid function decides how much information should it write to the new cell state c(t).
  7. The next and last gate is called ‘Output Gate’. This will have a sigmoid function followed by a multiplication gate with a tanh activation function, thus releasing values to hidden state for both feed forward and recurrent sides. Here, the higher the value of the sigmoid function and tanh function, the higher will be the value transmitted to the next hidden state h(t).
  8. In LSTM, you can see that all the 3 sigmoid and 1 tanh activation functions for which the input would be a concatenation of h(t-1) and x(t), has different weights associated with them, say w(f),w(i),w(c) and w(o). Then the total parameters required for training an LSTM model is 4 times larger than a normal RNN. So, the computational cost is extremely higher. To solve this problem, they invented something called GRU.

Gated Recurrent Unit (GRU)

  1. This was founded quite recently in 2014 where they reduced the number of parameters from LSTM, but just in case GRU doesn’t work well, then we will have to roll back to LSTM.
  2. In GRU, there is no explicit memory unit. Memory unit is combined along with the network.
  3. There is no forget gate and update gate in GRU. They are both combined together and thus the number of parameters are reduced.
  4. When comparing GRU with LSTM, it performs good but may have a slight dip in the accuracy. But still we have less number of trainable parameters which makes it advantageous to use.

Conclusion:

We talked about bi-directional RNN’s. But now, most of the vanilla RNN’s are replaced by LSTM’s and GRU’s. After this invention, we have taken a leap in dealing with sequence data in an extremely effective manner. In my text blog, I will explain about using RNN’s for POS tagging application in Keras.

Hope you enjoyed my blog. Thanks for reading :) Leave your comments or questions below or contact me on LinkedIn

--

--