( And why it got dethroned )

Recurrent Neural Networks

Understanding the former Emperor of Natural Language Processing.

14 min readMar 20, 2024

If you have ever done an introductory course on deep learning or even watched Veritasium videos to feel intellectually superior to normies about the modern day buzzword, then you know that there are three basic architectures that let you deal with basically any type of data you’re provided with.

Artificial Neural Networks

Typically used to map individual one dimensional input vectors to one dimensional output vectors. For example, given today’s weather parameters predict whether or not it will rain or to predict sales of the company in the next few months. Credit to EDUCBA

Convolutional Neural Networks

Which specialize in dealing with grid based data like images where instead of getting a bunch of numbers in a 1 dimensional vector, you’re dealing with multidimensional grids and the position of a value in the grid is relevant to generate the output. For example in this case, to identify the dog, the pixels individually are not great predictors of whether the image is a dog or not however, the presence of its neighbors collectively helps a lot. Credit to skyengine.ai

At first glance it seems that these two architectures would be enough to solve every problem out there.

You want Facial Recognition? CNN’s got YOU! You want to predict emotion based on people’s facial expressions? CNN’s here for you brother!

You can even combine the two architectures if you want to. At the end of the day, facial recognition softwares convert 2-D matrices into 1-D probabilities to identify which face are we looking at so its all good!

Just arrange the two architectures in different positional combinations and you’ve got yourself a solution to pretty much everything! You’re at the top of the world now and you are set out to conquer the world with your skill of creating custom architectures for custom business problems and nothing can ever stop you from doing that…

But then you actually step into the real world where you’re faced with the following problem…

An Image of Lightning Strikes across the United states of a specific date.

Consider the map above. You are provided with years worth of lightning maps across united states on a specific date and time. What if you are tasked with the problem of predicted these lightning strikes across the united states based on these images from the satellite.

You’ll think, “Oh, well its an image so we’ll use CNN for processing the image obviously”.

Uh-huh, okay and then what is going to be the output to that?

“Well, I guess since we want to know the status of each pixel we’ll just keep the net pixel count as the number of output neurons. Each neuron will represent a pixel in the output image. The neuron is going to use Binary cross entropy instead of categorical cross entropy. Which implies that each neuron is going to emit an output of probability between 0 and 1.

This way, we get an output vector with the dimensions of the total number of pixels and each pixel is going to have some probability of receiving lightning strikes!”

Okay woah-genius, but what if we want to predict the output for a specific date and time, what are you going to take as input?

“Well, I guess we’ll take the last image we have as input and try to predict the next image based on that”

Alright, but what about the fact that the output neurons have no dependency on EACH other. Based on the sample image we’ve provided you with, it’s not that hard to tell that neighboring locations influence each other in the event of a lightning strike. Also not to mention the fact that a lightning strike falling today could influence a lightning strike tomorrow, day after tomorrow and so on, anytime in the near future. It isn’t entirely dependent on just yesterday’s input but on a SEQUENCE of past inputs.

“Ummm….”

Um is right. You didn’t study the third crucial architecture. You got too caught up in the coolness of ANN’s and CNN’s that you forgot about…

Recurrent Neural Networks

These architectures are special as they can deal with sequences. They use the output produced by the previous input along with the current input to predict the current output.

// Converts input X into h
h ( t ) = W_xh * X(t) + W_hh * h(t - 1) 

// Converts the h into current state output by sending 
// it through a feed forward layer

Y(t) = tanh( W_hy * h ( t ) + b(t)  )

Where X is the t th input in the sequence, h( t — 1 ) is the output when X (t — 1) was the input and Y( t — 1) was the scaled output from that instance. W_hh is a self connecting weight and W_xh is the weight matrix that transforms the current input.

So to answer the lightning question, (and there may probably be better ways to solve it but) what we can do is use the images as input in a 5–10 day sequence vector, flatten out the images by passing it through a convolutional layer. Once it is flattened out, we will pass it through a recurrent neural network layer. The outputs of the RNN will be used again as the next image comes through the CNN layer.

Woah! That’s just the high level overview. Let’s dive in slightly deeper into it.

Essentially the flow is something we’ll write using python.

import torch 
import torch.nn as nn 

torch.random.seed()
''' 
Consider a sequence matrix X, which consists of 5 sequences, each element of 
four dimensions. The one in the front stands for one batch of the sequence. 
'''
X = torch.randn(size=torch.Size((1, 5, 4)))

# We instantiate the hidden state with the special zero state. 
h = [torch.zeros(torch.Size((1, 1, 3)))]

# Y is going to be the output at each state 
y = []

# Weight matrices representing layers and outputs for each time a new 
# sequence is processed. 
W_xh = torch.randn(size=torch.Size((1, 4, 3)))
W_hh = torch.randn(size=torch.Size((1, 3, 3)))
W_hy = torch.randn(size=torch.Size((1, 3, 2)))

# Implementing the computations. 
for t, x in enumerate(X[0]):
  x = torch.unsqueeze(torch.unsqueeze(x  , 0), 0)
  
  # essetially its h(t) = tanh(x * W_xh + h(t - 1) * W_hh) 
  h.append(nn.functional.tanh(torch.bmm(x, W_xh) + torch.bmm(h[t], W_hh))) 

  # y(t) = softmax(h(t) * W_hy) 
  y.append(nn.functional.softmax(torch.bmm(h[t+1], W_hy), dim=2)) 

# converting them into tensors. 
h = torch.stack(h, dim=2)
y = torch.stack(y, dim=2)

Now let’s check out our hidden states.

Hidden state at t = 1: tensor([-0.9026,  0.9546,  0.9818])
Hidden state at t = 2: tensor([ 0.0371,  0.1854, -0.3838])
Hidden state at t = 3: tensor([ 0.7934,  0.7028,  0.6505])
Hidden state at t = 4: tensor([-0.8487, -0.5144, -0.6398])
Hidden state at t = 5: tensor([ 0.6910,  0.9936, -0.3610])

We also have the outputs but for now it’s not really that important. You kind of understand how the flow is working now. This is how the recurrent layer would typically work when we’re dealing with a recurrent neural network.

Here, we are initializing W_xh, W_hh, and W_hy randomly along with our one batch input vector X. We’re ignoring the biases, although that is also an option you can set when you use pytorch classes. We’re taking the tanh of the hidden state, and getting a softmax output by processing the hidden state at T = t. We’re

Note that we received this output for the following input:

Input value at t = 1: tensor([[-0.6631, -2.2810, -0.4344, -0.9516]])
Input value at t = 2: tensor([[ 0.6260,  0.3256,  1.2832, -0.7670]])
Input value at t = 3: tensor([[ 1.1265,  0.5882, -0.6086, -0.5184]])
Input value at t = 4: tensor([[ 0.3995, -1.1024,  1.7330, -0.5669]])
Input value at t = 5: tensor([[ 0.8767,  0.8954, -0.6853, -1.8283]])

So what exactly are these hidden states H? Well, the hidden states signify the combined information of the last few inputs. So the hidden state at t = 2 represents the information the model learnt since hidden state 1 and hidden state 0.

Hidden state at three stores the information from hidden state 2, 1, and 0 and so on.

But how far in the past can the RNN retain this information?

Well according to this paper, an RNN can retain information upto past 10 sequences, as quoted by over here.

“Standard RNN cannot bridge more than 5–10 time steps ([22]). This is due to
that back-propagated error signals tend to either grow or shrink with every time step. Over many time steps the error therefore typically blows-up or vanishes ([5, 42]). Blown-up error signals lead straight to oscillating weights, whereas with a vanishing error, learning takes an unacceptable amount of time, or does not work at all.”

In simpler words, the Gradients that pass through the time steps during BPTT (back propagation through time ) tend to either blow up, or Shrink down.

But what does that even mean? Well let me make as simple as possible by getting into a little bit of math. (Said no one ever, but there really is no better way to explain it, without using math. But don’t worry, I’ll keep it pretty high level. )

So far the equations we created are as follows:

h ( t ) = W_xh x X(t) + W_hh x h(t - 1) 
O(t) = sigmoid( W_hy x h ( t ) + b(t))

We'll use '->' to show dependency. 

Let t = 3 
In that case: 

h(3) -> X(3) and h(2) 

but 

h(2) -> X(2) and h(1) 

and 

h(1) -> X(1) and h(0)

The above text is a dependency map. You can see that Hidden state at t = 3 is dependent on input at time = 3 and hidden state at t = 2.

But hidden state at time = 2 will be dependent on input at t = 2 and hidden state at t = 1. And so on..

So if you look at it a little bit more concretely.

h(3) = f(X(3), h(2)) = f(W_xh * X(3), W_hh * h(2)) 
h(2) = f(X(2), h(1)) = f(W_xh * X(2), W_hh * h(1)) 
h(1) = f(X(1), h(0)) = f(W_xh * X(1), W_hh * h(0)) 

# h(0) = constant 
h(0) = h0

This implies that if you want to take the derivative of output with respect to W_hh, first step is going to be taking care of the W_hh * h(2) which is simply going to be h(2) but since h(2) is ALSO dependent on W_hh * h(1) we will have to use the product rule of differentiation.

So our equation looks something like this:

h(3) = f( W_xh * X(3), 
         f( W_xh * X(2),  --- h(2) 
           f( W_xh * X(1), --- h(1) 
              W_hh * h(0)
            )              --- h(1) 
          )               --- h(2) 
        ) 
and 

O = W_hy * h(3)

Now this is simply for 3 time stamps. As you increase the number of time stamps, the number of nests will increase the derivative chain is going to expand.

Now is that an issue, you may ask. Well the issue is that the function f, is going to be a tanh function. So every value that comes out of it, is going be between -1 and 1. Depending on the scales of X values and the tanh values, the gradients are either going to become REALLY REALLY small or if you are using ReLU activation it may become REALLY REALLY massive over long sequences.

This is known as the vanishing or exploding gradient problem respectively. And to encounter this exact problem Sepp Hochreiter and Jürgen Schmidhuber published their research paper on the New Recurrent Neural Network which is the LSTM, which is something we will talk about in the next post.

Pytorch Implementation of the RNNs

Now that we’re done with the Theory of it all, let’s try to understand how these things are implemented in pytorch and how do we use and thoroughly understand the API for maximum customization.

The RNN class extends from RNNBase class which has implementations for the menage-e-trois i.e RNN, LSTM and GRU.
All these three architectures inherit functions from the RNNBase class, however two of the RNNBase children i.e. LSTM and GRU override the methods of RNNBase. The reason behind this was expressed at the time of writing as follows:

The implementation of LSTM and GRU required UNION types and ANY types of python which was not supported by TorchScript, which is a scripting language developed by pytorch developers for more convenient implementations. It is statically typed, which allows for data types such as ENUMS, and type declarations before declaring the vectors and complex data structures like Queue, Stack etc.
TorchScript also doesn’t have an implementation for Function or Callable python types either, which is why in the documentation instead of calling RNNBase’s _rnn_impls they call _VF.lstm or _VF.gru.

This is only temporary however, and they have added it to their to do to update the code as soon as the lacking types are instilled in TorchScript. Why believe me when you can heart it from them.

“TODO: remove the overriding implementations for LSTM and GRU when TorchScript. Support expressing these two modules generally.”

They’ve got it under control boys.

Now all of the RNN’s are based on something called an RNN cell which takes an input and carries out a sequential processing of the input sequences.

For this purpose a generic RNNCellBase class has been created in the same module which passes its functions down to the RNNCell, LSTMCell and GRUCells.

The RNNBaseCell supports two modes. RNN_TANH and RNN_RELU modes. The explanation for this is simple. They allow you to choose which activation functions to implement based on these modes. These modes can be set by the nonlinearity parameter from the RNN class.

Now let’s talk a little bit about each of these Base classes.

RNN Module

The RNN Cell implements the Elman RNN cell with ReLU or tanh activation functions. The idea behind it is fairly simple:

h’ (t) = tanh(W_ih x X(t) + b_ih + W_hh x h(t-1) + b_hh)

It’s going to take the current X and previous hidden state, do a weighted sum and then transform it using the activation function.

You can decide whether or not you wish to use the b_hh and b_ih by setting the bias parameter. If bias = False then the bias terms would be removed. By default it is set to true.

RNN Cells are the building blocks used by the RNN class to create your architecture. By setting hidden size you’re essentially telling the RNN that you want this many hidden neurons in the RNN Cell. By setting num layers, you’re stacking multiple RNN cells on top of each other in a 2-D grid architecture, where on the X-axis we have Time and on the y we have the layers.

Inputs and Outputs

For inputs, the Cell is going to take in a 3-D or 2-D tensor based on whether or not your input is batched or not.

If your input is batched, by default the RNN class is going to assume your data is arranged in the following format:

( Sequence_length, Batches, D * Dimension of Individual Sequence)

Here D = Bidirectional parameter. If birectional = True, then D = 2. If false then D = 1. By default bidirectional=False.

However commonly it makes more sense to arrange your data in the following format:

(Batches, Sequence_Length, D * Dimension of Individual Sequence)

To do this, you can set batch_first parameter to True which is set to false by default.

Along with the input sequences itself, the RNNCell also allows you to set an initial hidden state which can also be passed into the RNN Class if you ever wish to set an initial starting hidden state.

The input documentation for the RNNCell is as follows:

Inputs: input, hidden
— input : tensor containing input features
— hidden : tensor containing the initial hidden state
Defaults to zero if not provided.

The output of the RNNCell is going to be

Outputs: h(t)
— h (t) of shape `(batch, hidden_size)`: tensor containing the next hidden state
for each element in the batch

Which is essentially (Batches, Hidden Size of the layers in between) which is something you can set by hidden_size parameter of the RNN class.

You can use these RNNCells to stack them on top of each other for a multi layer RNN. The number of stacked layers can be decided by the num_layers parameter of the RNN class.

This essentially repeats the same cell that you’ve created num_layers times. This concept is called RNN stacking. The hidden state at t = 0 of the first layer becomes the h0 of the second layer. Each layer at each time stamp communicates with the upper layer and lower layer at each of the time stamps.

The output of the RNN Class is also a tuple:

Output tensor: This is going to output the final hidden state at each one of the time stamps after detailed processing. In the above diagram if I had to point out, it will be the hidden states popping out from the top most layer of boxes. The dimensions of this tensor is going to be (Sequence Length, batch_size, D * Hidden size ) if batch_first = False, and (batch_size, Sequence length, D * Hidden Size) if batch_first = True/
h0 tensor: This is going to be the output from each of the rows. The final hidden state from every single Stacked cell. On the diagram, you can consider it to be the outputs from every single box from the rightmost row of the grid. The dimensions of the h0 tensor is going to be (D * num_layers, batch_size, hidden_size) regardless of the batch_first parameter.

So intuitively it makes sense that your first assumption would be that the output tensors last hidden state of every batch and the h0 tensors last hidden state of every batch are going to match. Let’s test it out.

import torch.nn as nn 
import torch 

batch_size = 15
sequence_length = 10 
input_size = 4
hidden_size = 6
num_layers = 20
bmap = {1: False, 2: True} 
D = 1

X = torch.randn(size=torch.Size((sequence_length, batch_size, input_size))) 

X = torch.permute(X, (1, 0, 2) ) 
rnn = nn.RNN(
   input_size=input_size, 
   hidden_size=hidden_size, 
   num_layers=num_layers, 
   bidirectional=bmap[D], 
   batch_first=True,
) 

output, h0 = rnn(X) 
print(f"Top most row's outputs of shape (batch_size, sequence_length, \
D * input_size): {output.size()}") 

print(f"Left most row's outputs of shape (D * num_layers, batch_size, \
hidden_size) : {h0.size()}")

Top most rows outputs of shape : 
(batch_size, sequence_length, D * input_size): torch.Size([15, 10, 6])

Left most rows outputs of shape: 
(D * num_layers, batch_size, hidden_size) : torch.Size([20, 15, 6])

And now let’s verify if our understanding is correct. If this is indeed the case, for every single batch the last output is going to be the corner block of the diagram which is going to be the same every time.

h0 = torch.permute(h0, (1, 0, 2)) 
h0[:, -1, :] == output[:, -1, :]

And….

tensor([[True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True],
        [True, True, True, True, True, True]])

Voila! Yatta desu ne! Two language phrases in a single sentence. That’s wonderful but also we got the correct output as we expected.

The reason we had to permute the h0 tensor is because it always returns (D * num_layers, batch_size, hidden_size).

With this I think we can conclude or exploration of the RNN Cell. We will now endeavor to understand the next architecture which is the LSTM. This architecture proposed by Sepp Hochreiter and Jurgen Schmidhuber and it resolves the vanishing and exploding gradient problem by using the concept of Gates that control the flow of gradients so that the gradients don’t explode or shrink down.

LSTM is an important architecture in the Natural Language Hall of Fame, because although it didn’t revolutionize NLP as well as the transformer architecture, but before this, RNN’s sequential processing ability only allowed sentences with 5–10 words. But LSTM modified it not by not just 2 or 3 times, it modified it by a 100 times.

LSTM is the architecture that allowed sequences with 1000’s of tokens to be processed without getting the vanishing or exploding gradient problem. If you’ve ever worked with language data before with your only option being the RNN architecture, you know what a sight of sore eyes this would be for you.

How does it do that you may ask? Find out in the next post here.