Recurrent Neural Networks — Complete and In-depth
What is RNN?
A recurrent neural network is a type of deep learning neural net that remembers the input sequence, stores it in memory states/cell states, and predicts the future words/sentences.
Why RNN?
RNNs works well with inputs that are in the form of sequences. As an example, consider, “I like eating ice-creams. My favorite is chocolate ____”.
For humans, it is obvious to fill the blank with the word ice-cream, but the machine has to understand the context and remember the previous words in the sentence to predict the subsequent word. This is where RNNs are useful.
Applications: — Speech recognition(Google Voice Search), Machine translation(Google Translate), Time series forecasting, Sales forecasting, etc.
Architecture and working of RNN
Let’s consider x11, x12, x13, as inputs and O1, O2, O3 as outputs of Hidden Layers 1,2, and 3 respectively. The inputs are sent to the network at different time intervals, so let’s say x11 is sent to the hidden layer 1 at time t1, x12 @ t2, and x13 @ t3.
Also, let’s assume weights are the same in the forward propagation.
The output O3 is dependent on O2 which in turn is dependent on O1 as we see below.
O1 = f(x11*w) → where w is weight and f is activation function.
O2 = f(O1+x12*w)
O3 = f (O2 + x13*w)
Finally, the output of O3 is the actual output indicated by ŷ
Now, the Loss function will be calculated as (y — ŷ)^2. The goal is to reduce the loss function to the point we get y = ŷ in order to reach global minima which establishes the appropriate weight that has to be added in the network. This is achieved in backpropagation by using optimizers to adjust the weights.
An example of the application of the chain rule of differentiation during backward propagation:
Bi-Directional RNN
Example: “I’m ____ hungry, and I can eat 3 large pizzas in one go for lunch today”. So, forget machines, humans cannot predict appropriate words for the blank without reading the entire sentence. In this scenario, we make use of Bi-Directional Recurrent neural nets which not only provide information from the past but also hold information from the future.
The concept of Bi-Directional RNN is coupling 2 hidden layers which have the same input and producing output. The invention is that the output we get for a particular hidden layer of interest will have information from the past and also the future. See the architecture below,
To make it clear, to predict the output of ŷ13 we have O1, O2(from forward direction), and also O|3(from reverse direction). The drawback of Bi-RNN is that it is slow.
Drawbacks of RNN -
1. Vanishing gradient problem — This occurs when we use certain activation functions. So, during backpropagation, the weight updating will be very small from layer to layer, and at some point, the new weight that has to be added will become equal to the old weight thus there is no change, and training the network is difficult.
2. Exploding Gradient problem — In this case, weight updating is so huge that the network cannot learn from training data hence global minima can never be reached.
Therefore, LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units) serve better.
LSTM — Long Short-Term Memory
· LSTMs solve the problem of vanishing gradient problem.
· LSTMs have 2 states i.e., Hidden state and cell states as supposed to RNNs which only have hidden state.
· LSTMs forget some information that is not important when the context changes thus working very efficiently even for long sentences, which is not the case with RNN.
Architecture and Working of LSTM
The main components of LSTM are-
1.Memory Cell
2. Input Gate
3. Forget Gate
4. Output Gate
Below is the structure of LSTM. Let’s understand the operation
- Forget Gate
Here, the inputs ht-1 and xt are passed to the sigmoid activation function which outputs values between 0 and 1. 0 means completely forget and 1 means completely retain information. We use the sigmoid function as it acts as a gate.
Note: bf is the bias and Wf is the combined weight of the 2 inputs.
2. Input Gate
The motive of this stage is to identify new information and add to the cell state. This is done in 2 steps.
Step 1: The sigmoid layer outputs a value between 0 and 1 based on the inputs ht-1 and xt. as seen in the diagram above. At the same time, these inputs will be passed to the tanh layer which outputs values between -1 and 1 and creates vectors for the inputs.
Step 2: The output of the sigmoid layer and tanh layer is multiplied
Now, the cell state is updated from Ct-1(previous LSTM cell output) to Ct (Current LSTM cell output) as we see above.
3.Output Gate
First, the cell state is passed through tanh function and simultaneously we send inputs ht-1 and xt to the sigmoid function layer. Then multiplication takes place and ht is the output of this memory cell and is passed to the next cell.
Gated Recurrent Unit
For faster computation and less memory consumption GRUs are used. LSTMs perform better when accuracy is the key. GRUs do not have cell states, only hidden state.
Architecture and Working of GRU
Main Components of GRU are-
1. Update Gate(z)
2. Reset Gate(rt)
The below diagram represents GRU
- Update gate — Amount of information that must be passed forward
Where, W(z) is the weight associated with xt, U(z) is the weight associated with input from the previous state that is ht-1 and σ is the sigmoid activation function.
The output zt, will be between 0 and 1 based on which information will be passed on.
2. Reset Gate — Decides the amount of information to forget is determined
Where, W(r) is the weight associated with xt, U(r) is the weight associated with input from the previous state that is ht-1 and σ is the sigmoid activation function.
The output rt, will be between 0 and 1 based on which information will be forgotten.
Now, the important step is adding a memory component called the reset gate into the network. This reset gate pulls up the important information or the crux and assigns value = 1 and rest all sentences will be assigned value = 0
Mathematically we calculate as below,
Now finally, we use the below formula
Using this formula, we calculate the current state that is ht which will be passed onto the succeeding cells.
Sequence to Sequence Learning
The idea behind sequence to sequence learning is that input data that is received in one language is converted into another language. Ex: English → Somali.
Types of Sequence to Sequence Learning
1. Sequence to Sequence — Outputs are equal to the number of inputs.
2. Sequence to Vector — A single output is given for ’n’ number of inputs
3. Vector to Sequence — ’n’ number of outputs is received for 1 input
4. Vector to Vector — Single output is received for a single input
The below diagram summarizes the architecture of the above 4 learning methods
Encoders — Decoders / Sutskever Neural Machine Translation Model
It is not always the case that the input sequence and output sequence will be of the same length. Example —
In the above translation, we see that in English we have 3 characters but in Somali, it is 2 characters. In this scenario, Encoders and Decoders are employed.
Architecture and Working of Encoders-Decoders
Encoders are input networks that consist of LSTM or GRU cells and Decoders are output networks that are also made up of LSTM or GRU cells.
Encoder — We input A, B, C words to the encoder network and we get a context vector ‘w’ which has summarized information of the inputs.
Note: When the network hits <EOS> it stops the process.
Decoder — The context vector ‘w’ is sent to the decoder network as we see in the diagram above. For each of the inputs to the decoder network, we get output (X, Y, Z).
The final output of the decoder network is compared with the input sequence and the loss function is calculated. This loss function is reduced to the point actual outcome = predicted outcome using optimizers in the backpropagation.
The drawback of Encoder and Decoder — The context vector summarizes the whole input sequence but not all the words in the input sequence will be valuable to include in the summary. This is overcome by using Attention Based model.
Attention Models
Concept- Imagine you are listening to a speech and at the end of the speech you will not remember each, and every word uttered by the speaker, but you will retain the gist or summary of the speech. This is the concept of Attention models.
Architecture and Working of Attention Model
We have a neural network between the encoder and decoder. The output of the neural network will be the input to the decoder. At this point, we must also understand that the output of the neural net will be the one which has maximum attention or focus or the word that is important for prediction among the inputs it received.
To learn advanced concepts refer to the amazing articles linked below-
Transformers — http://jalammar.github.io/illustrated-transformer/v
BERT — http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
GTP 3 — http://jalammar.github.io/how-gpt3-works-visualizations-animations/
Acknowledgments —
- Krish Naik — https://www.youtube.com/user/krishnaik06/featured
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- Shriram Vasudevan — https://www.youtube.com/channel/UCma2b1uVLajAq9nHSEJh9HQ
- Sequence to Sequence Learning -https://papers.nips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf
Reach me at —
Email — tejasta@gmail.com
LinkedIn — https://www.linkedin.com/in/tejasta/
Thanks for reading!