Recurrent Neural Networks An intuitive approach Part 3

Niketh Narasimhan
16 min readJul 31, 2020

--

Please find the link for earlier parts

Part1:

Part2:

Contents:

  1. Recurrent Neural Networks
  2. RNN architecture
  3. Backpropagation through time
  4. LSTM
  5. GRU

Recurrent Neural Networks:

Recurrent Neural Network is a type of artificial deep learning neural network designed to process sequential data and recognize patterns in it (that’s where the term “recurrent” comes from).

The primary intention behind implementing RNN neural network is to produce an output based on input from a particular perspective.

The core concepts behind RNN are sequences and vectors. Let’s look at both:

  • Vector is an abstract representation of raw data that reiterates its meaning into a comprehensive form for the machine. It is a kind of text-to-machine translation of data as discussed earlier.
  • The sequence can be described as a collection of data points with some defined order (usually, it is a time-based, there can also be other specific criteria involved). An example of sequence can be time series stock market data — single point shows the current price while its sequence over a certain period shows the permutations of the cost.

Salient points of an RNN

1.Unlike other types of neural networks that process data straight, where each element is processed independently of the others, recurrent neural networks keep in mind the relations between different segments of data, in more general terms, context.

2.Given the fact that understanding of the context is critical in perception of information of any kind, this makes recurrent neural networks extremely efficient at recognizing and generating data based on patterns put into a specific context.

3.In essence, RNN is the network with contextual loops that enable the persistent processing of every element of the sequence with the output building upon the previous computations, which in other words, means Recurrent Neural Network enables making sense of data.

4.The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea. If you want to predict the next word in a sentence you better know which words came before it.

5. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps .

Basic RNN structure

RNN structure:

In its most basic form an RNN structure can be thought of as comprising of 3 components

  1. Input layer represents information to be processed;
  2. A hidden layer represents the algorithms at work;
  3. Output layer shows the result of the operation;

Hidden layer contains a temporal loop that enables the algorithm not only to produce an output but to feed it back to itself.

This means the neurons have a feature that can be compared to short-term memory. The presence of the sequence makes them to “remember” the state (i.e., context) of the previous neuron and pass that information to themselves in the “future” to further analyze data.

A recurrent neural network and the unfolding in time of the computation involved in its forward computation.

in the above diagram an RNN is being unfolded . By unfolding we mean writing the sequence for the complete sequence for example if our word contains 3 letter , it would be unfolded into 3 layers for each word. A better way of visualizing them is shown below

Now let us try and delve in to what the relationship between succesive inputs can be

Suppose there is a deeper network with one input layer, three hidden layers and one output layer. Then like other neural networks, each hidden layer will have its own set of weights and biases, let’s say, for hidden layer 1 the weights and biases are (w1, b1), (w2, b2) for second hidden layer and (w3, b3) for third hidden layer. This means that each of these layers are independent of each other, i.e. they do not memorize the previous outputs.

Now the RNN will do the following:

  • RNN converts the independent activations into dependent activations by providing the same weights and biases to all the layers, thus reducing the complexity of increasing parameters and memorizing each previous outputs by giving each output as input to the next hidden layer.
  • Hence these three layers can be joined together such that the weights and bias of all the hidden layers is the same, into a single recurrent layer.

The steps can be summarized as

Steps involved In RNN

Let us take an example to understand the above steps . Let us chose a word instead of sentence for simplicity lets say ‘State’ and we are going to feed the RNN 4 letters s,t,a,t respectively and ask it to predict the last letter ‘e’.So since ‘s’ is the first letter and has no previous state we start with ‘t’.

Note: if we chose a sentence the inputs will be nothing but word embeddings of the words for letter we can use one hot encoding

So at the time the letter “t” is supplied to the network, a recurrence formula is applied to the letter “t” and the previous state which is the letter “s”. These are known as various time steps of the input. So if at time t, the input is “t”, at time t-1, the input was “s”. The recurrence formula is applied to t and s both. and we get a new state.

As can be seen above each current state is a function of the current input and the previous state and each succesive input is called a time step.The function in its most basic form used is generally Tanh,the weight at the recurrent neuron is Whh and the weight at the input neuron is Wxh, we can write the equation for the state at time t as –

The Recurrent neuron in this case is just taking the immediate previous state into consideration. For longer sequences the equation can involve multiple such states. Once the final state is calculated we can go on to produce the output

Now, once the current state is calculated we can calculate the output state as

Note: The final output is calculated using a softmax classifier or any other function can also be used

Overall, the RNN neural network operation can be one of the three types:

  1. One input to multiple outputs — as in image recognition, image described with words;
  2. Several contributions to one output — as in sentiment analysis, where the text is interpreted as positive or negative;
  3. Many to many — as in machine translation, where the word of the text is translated according to the context they represent as a whole;

The key algorithms behind RNN are:

  • Backpropagation Through Time to classify sequential input- linking one-time step to the next
  • Vanishing/Exploding gradients — to preserve the accuracy of the results
  • Long Short-Term Memory Units — to recognize the sequences in the data

Let us try and understand the key algoritms

Backpropagation in a recurrent neural Network:

Backpropagation in a recurrent neural Network is also known as backpropagation through time(BPTT).

In case of an RNN, if yt is the predicted value ȳt is the actual value at time t, the error is calculated as a cross entropy loss –

Et(ȳt,yt) = — ȳt log(yt)

E(ȳ,y) = — ∑ ȳt log(yt)

We typically treat the full sequence (sentence) as one training example, so the total error is just the sum of the errors at each time step (word).

Just like we sum up the errors, we also sum up the gradients at each time step for one training example:

Since a recurrent neural net involves 100’s of time steps T , the network takes too long too converge.

BPTT is just like backpropagation in an artificial neural net , just that gradients and errors are summed at each time step t.

Vanishing and exploding gradient problem:

As a normal neural net suffers the problem of the gradient becoming close to zero due to the use of sigmoid or tanh functions so does the recurrent neural network suffer the same problem.

Derivative of tanh and sigmoid approach 0
  1. As we can see that the tanh and sigmoid functions have derivatives of 0 at both ends. They approach a flat line. When this happens we say the corresponding neurons are saturated.
  2. They have a zero gradient and drive other gradients in previous layers towards 0. Thus, with small values in the matrix and multiple matrix multiplications the gradient values are shrinking exponentially fast, eventually vanishing completely after a few time steps
  3. Gradient contributions from “far away” steps become zero, and the state at those steps doesn’t contribute to what you are learning since backpropagation is being applied sequentially at all steps to calculate the error long term dependencies don’t get registered.
  4. Long-range dependencies for example in a very large sentence containing more than 10 words you want the dependency between the first and the last word but while propagating the error backwards the gradient gets close to 0 at the 5th step and thus the subsequent time steps become 0.let us illustrate further

Vanishing gradients aren’t exclusive to RNNs. They also happen in deep Feedforward Neural Networks. It’s just that RNNs tend to be very deep (as deep as the sentence length in our case), which makes the problem a lot more common.

Exploding Gradient problem is when the gradients get too large depending on the activation function and network parameters. Leading too gradients that are too large thus taking a long time for the network to converge.

These can however be solved using gradient clipping as in cut off the gradients after a predefined value.

Vanishing gradients can be solved by using the appropriate activation function such as ReLU and by choosing proper weights matrix.

Vanishing gradients can also be solved by using LSTM’s.

LSTM(long short term Memory) Networks and GRU’s(gated recurrent unit)

Long Short Term Memory networks usually just known as LSTM’s.

As we have seen in the previous section that RNN’s are prone to short term memory ,the long term dependencies are not captured due to the vanishing gradient problem ,To overcome this problem we can use LSTM network and GRU’s

Note: Gated RNN architectures like the Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) are used to address the vanishing Gradient problem associated with vanilla RNN’s(basic RNN structure)

LSTM’s and GRU’s were created as a method to mitigate short-term memory using mechanisms called gates. Gates are just neural networks that regulate the flow of information flowing through the sequence chain.

before we proceed any further Let us try and understand how we can make our neural networks into logic gates:

Logic gates

The most common logic gate are AND, OR, NOT. The logic gate AND only returns 1 if both inputs are 1 else 0, logic gate OR returns 1 for all inputs with 1, and will only return 0 if both input is 0 and lastly logic gate NOT returns the invert of the input, if the input is 0 it returns 1, if the input is 1 it returns 0.

In order for the neural network to become a logical network, we need to show that an individual neuron can act as an individual logical gate. To show that a neural network can carry out any logical operation it would be enough to showthat int can function as a NAND gate(which it can). However, to make things more beautiful and understandable, lets dive in deep and show how a neuron can act as any of a set of gates we will need — namely the AND and OR gates as well as a comparison gate of x>0.

Note: Remember that a sigmoid function outputs values between 0 and 1

Comparison x1>0

This is the innermost part of our function. We can calculate this expression quite easily using a single neuron and a Sigmoid activation function. In this case there will only be one variable input to the neuron, x1, and we would like the neuron to output a value close to 1 if x1 is positive and output a value close to 0 if x1 is negative. Although it isn’t important for this problem, lets say that we want an output of 0.5 if x1 is exactly zero.

AND Gate

Now we can do greater than comparisons, the next most inner function of the target logical expression is the AND operator. In this case the input to the sigmoid function will be z=w3*a1 + w4*a2 + b3. Here w3 and w4 are weights and a1 and a2 are the activations of the first layer of neurons. Variable a1 is very close to one if x1>0 and very close to zero if x1<0; the value of a2 is similar.

OR Gate

The next most inner part of our logical expression is the OR gate. For an OR gate we want the output to be close to 1 when one or more of the inputs is ~1 and zero otherwise. The input to the Sigmoid function is z=w7*a3 + w8*a4 + b5. Here a3 represents whether both x1 and x2 were positive, and a4 represents whether both x1 and x2 were negative.

Putting it all together

We now have all of the pieces we need to create our neural network emulator of the logical expression above. So putting everything together the full architecture of the network looks like this

Note: The value of the weights and Biases can be calculated appropriately.

Coming back to the gated inputs

When we arrange our calendar for the day, we prioritize our appointments right? If in case we need to make some space for anything important we know which meeting could be canceled to accommodate a possible meeting.

Turns out that an RNN doesn’t do so. In order to add a new information, it transforms the existing information completely by applying a function. Because of this, the entire information is modified, on the whole, i. e. there is no consideration for ‘important’ information and ‘not so important’ information.

LSTMs on the other hand, make small modifications to the information by multiplications and additions. With LSTMs, the information flows through a mechanism known as cell states. This way, LSTMs can selectively remember or forget things. The information at a particular cell state has three different dependencies.

We’ll visualize this with an example. Let’s take the example of predicting stock prices for a particular stock. The stock price of today will depend upon:

  1. The trend that the stock has been following in the previous days, maybe a downtrend or an uptrend.
  2. The price of the stock on the previous day, because many traders compare the stock’s previous day price before buying it.
  3. The factors that can affect the price of the stock for today. This can be a new company policy that is being criticized widely, or a drop in the company’s profit, or maybe an unexpected change in the senior leadership of the company.

These dependencies can be generalized to any problem as:

  1. The previous cell state (i.e. the information that was present in the memory after the previous time step)
  2. The previous hidden state (i.e. this is the same as the output of the previous cell)
  3. The input at the current time step (i.e. the new information that is being fed in at that moment)

LSTM architecture:

A typical LSTM network is comprised of different memory blocks called cells
(the rectangles that we see in the image). There are two states that are being transferred to the next cell; the cell state and the hidden state. The memory blocks are responsible for remembering things and manipulations to this memory is done through three major mechanisms, called gates. Each of them is being discussed below.

Forget Gate

Let us take an example of a text prediction problem.

This gate takes in two inputs; h_t-1 and x_t.

h_t-1 is the hidden state from the previous cell or the output of the previous cell and x_t is the input at that particular time step. The given inputs are multiplied by the weight matrices and a bias is added. Following this, the sigmoid function (output 0 or 1) is applied to this value.

Note:Basically, the sigmoid function decides which values to keep .if 0 then the forget gate forgets that piece of information and if 1 then the forget gate keeps that piece of information.

Forget gate

Input Gate;

Let us take another text prediction problem,

The input gate is responsible for the addition of information to the cell state. This addition of information is basically three-step process as seen from the diagram above.

  1. Regulating what values need to be added to the cell state by involving a sigmoid function. This is basically very similar to the forget gate and acts as a filter for all the information from h_t-1 and x_t.
  2. Creating a vector containing all possible values that can be added (as perceived from h_t-1 and x_t) to the cell state. This is done using the tanh function, which outputs values from -1 to +1.
  3. Multiplying the value of the regulatory filter (the sigmoid gate) to the created vector (the tanh function) and then adding this useful information to the cell state via addition operation.

Once this three-step process is done with, we ensure that only that information is added to the cell state that is important and is not redundant.

It can be summed up as below:

Output Gate:

Let us take an extension of the same problem

The functioning of an output gate can again be broken down to three steps:

  1. Creating a vector after applying tanh function to the cell state, thereby scaling the values to the range -1 to +1.
  2. Making a filter using the values of h_t-1 and x_t, such that it can regulate the values that need to be output from the vector created above. This filter again employs a sigmoid function.
  3. Multiplying the value of this regulatory filter to the vector created in step 1, and sending it out as a output and also to the hidden state of the next cell.

The filter in the above example will make sure that it diminishes all other values but ‘Bob’. Thus the filter needs to be built on the input and hidden state values and be applied on the cell state vector.

GRU:

The GRU is like a long short-term memory with a forget gate but has fewer parameters than LSTM, as it lacks an output gate.

Update gate: It determines how much of the past knowledge needs to be passed along into the future. It is analogous to the Output Gate in an LSTM recurrent unit.

Reset Gate:It determines how much of the past knowledge to forget. It is analogous to the combination of the Input Gate and the Forget Gate in an LSTM recurrent unit.

Intuition:If we observe closely, it functions the exact opposite of what the Update Gate does. Here, the Sigmoid Function converts the value in the range that lies between 0 to 1 and the value closer to zero will not be used further while the value ranging closer to 1 will be processed forward.

3. Current memory content:It is often overlooked during a typical discussion on Gated Recurrent Unit Network. It is incorporated into the Reset Gate just like the Input Modulation Gate is a sub-part of the Input Gate and is used to introduce some non-linearity into the input and to also make the input Zero-mean. Another reason to make it a sub-part of the Reset gate is to reduce the effect that previous information has on the current information that is being passed into the future.

Note: In mathematics, the Hadamard product is a binary operation that takes two matrices of the same dimensions and produces another matrix of the same dimension as the operands where each element i, j is the product of elements i, j of the original two matrices.

Advantages of Recurrent Neural Network

  1. An RNN remembers each and every information through time. It is useful in time series prediction only because of the feature to remember previous inputs as well. This is called Long Short Term Memory.
  2. Recurrent neural network are even used with convolutional layers to extend the effective pixel neighborhood.

Disadvantages of Recurrent Neural Network

  1. Gradient vanishing and exploding problems.
  2. Training an RNN is a very difficult task.
  3. It cannot process very long sequences if using tanh or relu as an activation function.

--

--