Recurrent Neural Network — Lesson 3: Working and Mathematics of Vanilla RNNs

Machine Learning in Plain English
2 min readAug 10, 2023

--

Working and Mathematics of Vanilla RNNs

A vanilla RNN, also known as a simple or Elman RNN, processes inputs sequentially, maintaining an internal state (hidden state) that encodes information about the inputs it has processed so far.

The hidden state h at time t is computed by applying a non-linear activation function (like tanh or ReLU) to the sum of the product of the input at time t (x_t) and its associated weight (W_x), and the product of the hidden state at the previous time step (h_(t-1)) and its associated weight (W_h). A bias term b is also added. In mathematical terms:

h_t = f(W_x * x_t + W_h * h_(t-1) + b)

The output y at time t is then computed by applying a second linear transformation to h_t:

y_t = W_y * h_t + b_y

Problems: Vanishing and Exploding Gradients

During the training of neural networks, gradients are used to update the weights. However, in RNNs, when gradients are backpropagated through time, they can either shrink exponentially (vanish) or grow exponentially (explode), leading to the vanishing gradient problem or exploding gradient problem respectively.

The vanishing gradient problem makes it difficult for the RNN to learn and tune the parameters for the early layers in the network, making it hard for the RNN to capture long-term dependencies in the data. On the other hand, exploding gradients can lead to unstable and inefficient learning.

Short-term Memory of Vanilla RNNs

Due to the vanishing gradient problem, vanilla RNNs tend to have a “short-term memory”, i.e., they can struggle to maintain the influence of information from earlier time steps as the sequence gets longer. While they can typically handle short sequences relatively well, their performance can degrade for longer sequences. This is a significant limitation as many sequential tasks involve important dependencies over varying time scales.

--

--