Recurrent Neural Networks — Complete and In-depth

Tejas T A
Analytics Vidhya
Published in
8 min readDec 2, 2020

--

What is RNN?

A recurrent neural network is a type of deep learning neural net that remembers the input sequence, stores it in memory states/cell states, and predicts the future words/sentences.

Why RNN?

RNNs works well with inputs that are in the form of sequences. As an example, consider, I like eating ice-creams. My favorite is chocolate ____.

For humans, it is obvious to fill the blank with the word ice-cream, but the machine has to understand the context and remember the previous words in the sentence to predict the subsequent word. This is where RNNs are useful.

Applications: — Speech recognition(Google Voice Search), Machine translation(Google Translate), Time series forecasting, Sales forecasting, etc.

Architecture and working of RNN

Let’s consider x11, x12, x13, as inputs and O1, O2, O3 as outputs of Hidden Layers 1,2, and 3 respectively. The inputs are sent to the network at different time intervals, so let’s say x11 is sent to the hidden layer 1 at time t1, x12 @ t2, and x13 @ t3.

Also, let’s assume weights are the same in the forward propagation.

The output O3 is dependent on O2 which in turn is dependent on O1 as we see below.

O1 = f(x11*w) → where w is weight and f is activation function.

O2 = f(O1+x12*w)

O3 = f (O2 + x13*w)

Finally, the output of O3 is the actual output indicated by ŷ

Architecture and Working of Simple RNN

Now, the Loss function will be calculated as (y — ŷ)^2. The goal is to reduce the loss function to the point we get y = ŷ in order to reach global minima which establishes the appropriate weight that has to be added in the network. This is achieved in backpropagation by using optimizers to adjust the weights.

An example of the application of the chain rule of differentiation during backward propagation:

Chain Rule of Differentiation

Bi-Directional RNN

Example: I’m ____ hungry, and I can eat 3 large pizzas in one go for lunch today. So, forget machines, humans cannot predict appropriate words for the blank without reading the entire sentence. In this scenario, we make use of Bi-Directional Recurrent neural nets which not only provide information from the past but also hold information from the future.

The concept of Bi-Directional RNN is coupling 2 hidden layers which have the same input and producing output. The invention is that the output we get for a particular hidden layer of interest will have information from the past and also the future. See the architecture below,

Bi-Directional RNN architecture

To make it clear, to predict the output of ŷ13 we have O1, O2(from forward direction), and also O|3(from reverse direction). The drawback of Bi-RNN is that it is slow.

Drawbacks of RNN -

1. Vanishing gradient problem — This occurs when we use certain activation functions. So, during backpropagation, the weight updating will be very small from layer to layer, and at some point, the new weight that has to be added will become equal to the old weight thus there is no change, and training the network is difficult.

2. Exploding Gradient problem — In this case, weight updating is so huge that the network cannot learn from training data hence global minima can never be reached.

Therefore, LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units) serve better.

LSTM — Long Short-Term Memory

· LSTMs solve the problem of vanishing gradient problem.

· LSTMs have 2 states i.e., Hidden state and cell states as supposed to RNNs which only have hidden state.

· LSTMs forget some information that is not important when the context changes thus working very efficiently even for long sentences, which is not the case with RNN.

Architecture and Working of LSTM

The main components of LSTM are-

1.Memory Cell

2. Input Gate

3. Forget Gate

4. Output Gate

Below is the structure of LSTM. Let’s understand the operation

LSTM Architecture
  1. Forget Gate
Forget Gate

Here, the inputs ht-1 and xt are passed to the sigmoid activation function which outputs values between 0 and 1. 0 means completely forget and 1 means completely retain information. We use the sigmoid function as it acts as a gate.

Note: bf is the bias and Wf is the combined weight of the 2 inputs.

2. Input Gate

Input Gate

The motive of this stage is to identify new information and add to the cell state. This is done in 2 steps.

Step 1: The sigmoid layer outputs a value between 0 and 1 based on the inputs ht-1 and xt. as seen in the diagram above. At the same time, these inputs will be passed to the tanh layer which outputs values between -1 and 1 and creates vectors for the inputs.

Step 2: The output of the sigmoid layer and tanh layer is multiplied

Updating the cell state

Now, the cell state is updated from Ct-1(previous LSTM cell output) to Ct (Current LSTM cell output) as we see above.

3.Output Gate

Output Gate

First, the cell state is passed through tanh function and simultaneously we send inputs ht-1 and xt to the sigmoid function layer. Then multiplication takes place and ht is the output of this memory cell and is passed to the next cell.

Gated Recurrent Unit

For faster computation and less memory consumption GRUs are used. LSTMs perform better when accuracy is the key. GRUs do not have cell states, only hidden state.

Architecture and Working of GRU

Main Components of GRU are-

1. Update Gate(z)

2. Reset Gate(rt)

The below diagram represents GRU

Block Diagram of Gated Recurrent Unit
  1. Update gate — Amount of information that must be passed forward

Where, W(z) is the weight associated with xt, U(z) is the weight associated with input from the previous state that is ht-1 and σ is the sigmoid activation function.

The output zt, will be between 0 and 1 based on which information will be passed on.

2. Reset Gate — Decides the amount of information to forget is determined

Where, W(r) is the weight associated with xt, U(r) is the weight associated with input from the previous state that is ht-1 and σ is the sigmoid activation function.

The output rt, will be between 0 and 1 based on which information will be forgotten.

Now, the important step is adding a memory component called the reset gate into the network. This reset gate pulls up the important information or the crux and assigns value = 1 and rest all sentences will be assigned value = 0

Mathematically we calculate as below,

Now finally, we use the below formula

Using this formula, we calculate the current state that is ht which will be passed onto the succeeding cells.

Sequence to Sequence Learning

The idea behind sequence to sequence learning is that input data that is received in one language is converted into another language. Ex: English → Somali.

Types of Sequence to Sequence Learning

1. Sequence to Sequence — Outputs are equal to the number of inputs.

2. Sequence to Vector — A single output is given for ’n’ number of inputs

3. Vector to Sequence — ’n’ number of outputs is received for 1 input

4. Vector to Vector — Single output is received for a single input

The below diagram summarizes the architecture of the above 4 learning methods

4 methods of Seq-Seq learning

Encoders — Decoders / Sutskever Neural Machine Translation Model

It is not always the case that the input sequence and output sequence will be of the same length. Example —

Ex of application of encoder-decoder

In the above translation, we see that in English we have 3 characters but in Somali, it is 2 characters. In this scenario, Encoders and Decoders are employed.

Architecture and Working of Encoders-Decoders

Encoders are input networks that consist of LSTM or GRU cells and Decoders are output networks that are also made up of LSTM or GRU cells.

Encoder-Decoder architecture

Encoder — We input A, B, C words to the encoder network and we get a context vector ‘w’ which has summarized information of the inputs.

Note: When the network hits <EOS> it stops the process.

Decoder — The context vector ‘w’ is sent to the decoder network as we see in the diagram above. For each of the inputs to the decoder network, we get output (X, Y, Z).

The final output of the decoder network is compared with the input sequence and the loss function is calculated. This loss function is reduced to the point actual outcome = predicted outcome using optimizers in the backpropagation.

The drawback of Encoder and Decoder — The context vector summarizes the whole input sequence but not all the words in the input sequence will be valuable to include in the summary. This is overcome by using Attention Based model.

Attention Models

Concept- Imagine you are listening to a speech and at the end of the speech you will not remember each, and every word uttered by the speaker, but you will retain the gist or summary of the speech. This is the concept of Attention models.

Architecture and Working of Attention Model

We have a neural network between the encoder and decoder. The output of the neural network will be the input to the decoder. At this point, we must also understand that the output of the neural net will be the one which has maximum attention or focus or the word that is important for prediction among the inputs it received.

The architecture of Attention Based Model

To learn advanced concepts refer to the amazing articles linked below-

Transformershttp://jalammar.github.io/illustrated-transformer/v

BERT http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

GTP 3http://jalammar.github.io/how-gpt3-works-visualizations-animations/

Acknowledgments

  1. Krish Naik — https://www.youtube.com/user/krishnaik06/featured
  2. https://colah.github.io/posts/2015-08-Understanding-LSTMs/
  3. Shriram Vasudevan — https://www.youtube.com/channel/UCma2b1uVLajAq9nHSEJh9HQ
  4. Sequence to Sequence Learning -https://papers.nips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf

Reach me at —

Email — tejasta@gmail.com

LinkedIn — https://www.linkedin.com/in/tejasta/

Thanks for reading!

--

--

Tejas T A
Analytics Vidhya

Data Scientist. Competent in Machine Learning, Deep Learning, and Natural Language Processing