Recurrent Neural Networks (RNNs)

4 min readJan 3, 2023

A recurrent neural network (RNN) is a type of artificial neural network that takes the output from the previous step as an input to the next step. It can retain information from the past and use that information to process new input.

In some standard NLP techniques like BoW, Word2Vec or TF-IDF, we lose the structure of the sentence (as the order of the words is not preserved). This impacts the accuracy of the model. RNNs come to the rescue with their key feature — hidden states, which allow the network to remember information from previous input and use it to influence the processing of new input. This is achieved through feedback connections between the hidden units and the input units, which allow the network to retain information from one time step to the next.

RNNs are well-suited to tasks that involve sequential data, such as natural language processing, speech recognition, and time series forecasting. They have been used to achieve state-of-the-art results in many of these tasks, and have become a popular choice for building machine-learning models that operate on sequential data.

Any neural network has an input layer, hidden layer(s) and output layer. As seen in the above (Fig 1), multiple hidden layers are compressed to form a single layer of RNN. Fig 2 represents how it looks after unfolding.

This means that the output of the previous layer is fed into the next layer for making predictions, repeating the same structure, hence the name ‘Recurrent’ Neural Network.

X: Input. Eg, a word or any other type of sequential data
O: Output. The next word or sequence is an output from the network.
h(t): Represents a hidden state at time t and acts as the “memory” of the network.
V: Represents the communication from one time-step to the other.

Forward Pass

In the forward pass, the output is calculated as follows:

where f is an activation function. Eg -> f = tanh -> h(t)=tanh(aₜ)

h(t) is calculated based on the current input and the previous time step’s hidden state h(t-1). The function f is taken to be a non-linear transformation such as tanh, ReLU. U is the input weight vector corresponding to the input X and V is the weight vector corresponding to the memory

Backpropagation Through Time (BPTT)

In backpropagation, we calculate the gradient and then pass it to the network. But the gradient of the time step t depends on all of the timesteps after that i.e. t+1, t+2, … So, when the sequence length is large, then the gradient starts to vanish. This causes the problem of vanishing gradient and the RNN can’t remember the long sequences.

Types of RNNs

One-to-one: Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification).
One-to-many: Sequence output (e.g. image captioning takes an image and outputs a sentence of words).
Many-to-one: Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing a positive or negative sentiment)
Many-to-many: Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French).
Many-to-many: Synced sequence input and output (e.g. video classification where we wish to label each frame of the video).

Applications of RNNs

Some common applications are —

A sequence of words (eg. product reviews)
Machine/Language Translation
Speech recognition
Image caption
Time series

Limitations of RNNs

Can’t remember very long sequences. It gets into the problem of vanishing gradient.
Slow to train as BPTT takes time. To improve this, we can use Truncated BPTT in which we divide the input sequence into small chunks and apply BPTT on those chunks.

Because of these limitations, we have modifications of simple RNNs, called Long short-term memory (LSTM) and Gated recurrent unit (GRU).

We will discuss these two in detail in upcoming articles.

If you like it, please leave a 👏.

Feedback/suggestions are always welcome.