Understanding Recurrent Neural Network (RNN) and Long Short Term Memory(LSTM)
Well,In this article, we are going to understand Recurrent Neural Network and Long Short Term Memory. We will go through the basics and how it is working.So lets first understand it.
RNN stands for Recurrent Neural Network. It is a type of neural network which contains memory and best suited for sequential data. RNN is used by Apples Siri and Googles Voice Search. Let’s discuss some basic concepts of RNN.
Before going deep inside lets first understand about Forward Propagation and Backward Propagation
Forward and Backward Propagation
Forward Propagation: This is the simplest type of neural network. Data flows only in forward direction from input layer to hidden layers to the output layer. It may contain one or more hidden layers. All the nodes are fully connected.We do forward propagation to get the output of the model and check its accuracy and get the error.
Backward Propagation:Back propagation method is used to train neural networks. If there are a lot of hidden layers, it may be referred as deep neural network. Once the forward propagation is completed, we calculate the error. This error is then back-propagated to the network to update the weights.
We go backward through the neural network to find the partial derivatives of the error (loss function) with respect to the weights. This partial derivative is now multiplied with learning rate to calculate step size. This step size is added to the original weights to calculate new weights. That is how a neural network learns during the training process.
1.What is RNN?
Recurrent Neural Network is a generalization of feed-forward neural network that has an internal memory. RNN is recurrent in nature as it performs the same function for every input of data while the output of the current input depends on the past one computation. After producing the output, it is copied and sent back into the recurrent network. For making a decision, it considers the current input and the output that it has learned from the previous input.
Unlike feed-forward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. In other neural networks, all the inputs are independent of each other. But in RNN, all the inputs are related to each other.
Best suited for sequential data
RNN is best suited for sequential data. It can handle arbitrary input / output lengths. RNN uses its internal memory to process arbitrary sequences of inputs.
This makes RNNs best suited for predicting what comes next in a sequence of words. Like a human brain, particularly in conversations, more weight is given to recency of information to anticipate sentences.
RNN that is trained to translate text might learn that “dog” should be translated differently if preceded by the word “hot”.
RNN has internal memory
RNN has memory capabilities. It memorizes previous data. While making a decision, it takes into consideration the current input and also what it has learned from the inputs it received previously. Output from previous step is fed as input to the current step creating a feedback loop.
So, it calculates its current state using set of current input and the previous state. In this way, the information cycles through a loop.
In nutshell, we can say that RNN has two inputs, the present and the recent past. This is important because the sequence of data contains crucial information about what is coming next, which is why a RNN can do things other algorithms can’t.
The different types of RNN are:
- One to One RNN
- One to Many RNN
- Many to One RNN
- Many to Many RNN
We will review the basic idea of RNN and then, move on to the different types of RNN and explore them in depth.
Types of RNN
So we have established that Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. RNN models are mostly used in the fields of natural language processing and speech recognition. Lets look at its types:
One to One RNN
One to One RNN (Tx=Ty=1) is the most basic and traditional type of Neural network giving a single output for a single input, as can be seen in the above image.It is also known as Vanilla Neural Network. It is used to solve regular machine learning problems.
One to Many
One to Many (Tx=1,Ty>1) is a kind of RNN architecture is applied in situations that give multiple output for a single input. A basic example of its application would be Music generation. In Music generation models, RNN models are used to generate a music piece(multiple output) from a single musical note(single input).
Many to One
Many-to-one RNN architecture (Tx>1,Ty=1) is usually seen for sentiment analysis model as a common example. As the name suggests, this kind of model is used when multiple inputs are required to give a single output.
Take for example The Twitter sentiment analysis model. In that model, a text input (words as multiple inputs) gives its fixed sentiment (single output). Another example could be movie ratings model that takes review texts as input to provide a rating to a movie that may range from 1 to 5.
Many-to-Many
As is pretty evident, Many-to-Many RNN (Tx>1,Ty>1) Architecture takes multiple input and gives multiple output, but Many-to-Many models can be two kinds as represented above:
1.Tx=Ty:
This refers to the case when input and output layers have the same size. This can be also understood as every input having a output, and a common application can be found in Named entity Recognition.
2.Tx!=Ty:
Many-to-Many architecture can also be represented in models where input and output layers are of different size, and the most common application of this kind of RNN architecture is seen in Machine Translation. For example, “I Love you”, the 3 magical words of the English language translates to only 2 in Spanish, “te amo”. Thus, machine translation models are capable of returning words more or less than the input string because of a non-equal Many-to-Many RNN architecture works in the background.
Vanishing and Exploding Gradients
Let’s first understand what is gradient?
Gradient: As discussed above in back-propagation section, a gradient is a partial derivative with respect to its inputs. A gradient measures how much the output of a function changes, if you change the inputs a little bit. You can also think of a gradient as the slope of a function. Higher the gradient, steeper the slope and the faster a model can learn. If the slope is almost zero, the model stops to learn. A gradient simply measures the change in all weights with regard to the change in error.
Gradient issues in RNN
While training an RNN algorithm, sometimes gradient can become too small or too large. So, the training of an RNN algorithm becomes very difficult in this situation. Due to this, following issues occur:
- Poor Performance
2. Low Accuracy
3. Long Training Period
Exploding Gradient: When we assign high importance to the weights, exploding gradient issue occurs. In this case, values of a gradient become too large and slope tends to grow exponentially. This can be solved using following methods:
- Identity Initialization
2. Truncated Back-propagation
3. Gradient Clipping
Vanishing Gradient: This issue occurs when the values of a gradient are too small and the model stops learning or takes way too long because of that. This can be solved using following methods:
- Weight Initialization
2. Choosing the right Activation Function
3. LSTM (Long Short-Term Memory) Best way to solve the vanishing gradient issue is the use of LSTM (Long Short-Term Memory).
2. What is LSTM (Long Short Term Memory)
As we discussed RNN are not able memorize data for long time and begins to forget its previous inputs. To overcome this problem of vanishing and exploding gradient LSTM is used. They are used as solution for short term memory learning. Also in RNN when a new information is added RNN completely modifies the existing information. RNN is not able to distinguish between important or not so important information. Whereas in LSTM there is small modification in existing information when a new information is added because LSTM contains gate which determine the flow of information.
The gates decide which data is important and can be useful in future and which data has to be erased. The three gates are input gate, output gate and forget gate.
- Forget Gate: This gate decide which information is important and should be stored and which information to forget. It removes the non important information from neuron cell. This results in optimization of performance. This gate takes 2 input- one is the output generated by previous cell and other is input of current cell. Following required bias and weights are added and multiplied and sigmoid function is applied to the value. A value between 0 and 1 is generated and based on this we decide which information to keep. If value is 0 the forget gate will remove that information and if value is 1 then information is important and has to be remembered.
- Input Gate: This gate is used to add information to neuron cell. It is responsible of what values should be added to cell by using activation function like sigmoid. It creates an array of information that has to be added. This is done by using another activation function called tanh. It generates a values between -1 and 1. The sigmoid function act as a filter and regulate what information has to be added in cell.
- Output Gate: This gate is responsible for selecting important information from current cell and show it as output. It creates a vector of values using tanh function which ranges from -1 to 1. It uses previous output and current input as a regulator which also includes sigmoid function and decides which values should be shown as output.
Squashing / Activation Functions in LSTM
- Logistic (sigmoid): Outputs range from 0 to 1.
2. Hyperbolic Tangent (tanh): Outputs range from -1 to 1.
Tips for Improving Model Performance:
- We can improve our model performance by adding more layers. It is always preferred to have more(dense) layers than to have wide layers of less number.
- But, we must take care to not overfit the data and for that we can try using various regularization methods.
- Batch normalization: Batch normalization (batchnorm) is a technique to improve performance and accuracy of a neural network. Batch normalization occurs per batch. That is why, it is called batch normalization. We normalize (mean = 0, standard deviation = 1) the output of a layer before applying the activation function, and then feed it into the next layer in a neural network. So, instead of just normalizing the inputs to the network, we normalize the inputs to each hidden layer within the network.
Thanks for reading! I am going to be writing more Deep Learning articles in the future too. Follow me up to be informed about them. And I am also a freelancer,If there is some freelancing work on data-related projects feel free to reach out over Linkedin.Nothing beats working on real projects!