Recurrent Neural Network (RNN) Architecture Explained

Sushmita Poudel
6 min readAug 28, 2023

--

This article will provide insights into RNNs and the concept of backpropagation through time in RNN, as well as delve into the problem of vanishing and exploding gradient descent in RNNs.

Recurrent Neural Networks (RNNs) were introduced to address the limitations of traditional neural networks, such as FeedForward Neural Networks (FNNs), when it comes to processing sequential data. FNN takes inputs and process each input independently through a number of hidden layers without considering the order and context of other inputs. Due to which it is unable to handle sequential data effectively and capture the dependencies between inputs. As a result, FNNs are not well-suited for sequential processing tasks such as, language modeling, machine translation, speech recognition, time series analysis, and many other applications that requires sequential processing. To address the limitations posed by traditional neural networks, RNN comes into the picture.

RNN overcome these limitations by introducing a recurrent connection that allow information to flow from one time-step to the next. This recurrent connection enables RNNs to maintain internal memory, where the output of each step is fed back as an input to the next step, allowing the network to capture the information from previous steps and utilize it in the current step, enabling model to learn temporal dependencies and handle input of variable length.

fig 1: FeedForward Neural Network (FNN). Image by Author
fig 2: Recurrent Neural Network (RNN). Image by Author

Architecture Of RNN

For more clear understanding of the concept of RNN, let’s look at the unfolded RNN diagram.

fig 3: RNN Unfolded. Image by Author.

The RNN takes an input vector X and the network generates an output vector y by scanning the data sequentially from left to right, with each time step updating the hidden state and producing an output. It shares the same parameters across all time steps. This means that, the same set of parameters, represented by U, V, W is used consistently throughout the network. U represents the weight parameter governing the connection from input layer X to the hidden layer h , W represents the weight associated with the connection between hidden layers, and V for the connection from hidden layer h to output layer y. This sharing of parameters allows the RNN to effectively capture temporal dependencies and process sequential data more efficiently by retaining the information from previous input in its current hidden state.

At each time step t, the hidden state aₜ is computed based on the current input xₜ , previous hidden state aₜ₋₁ and model parameters as illustrated by the following formula:

aₜ = f(aₜ₋₁, xₜ; θ) — — — — — (1)

It can also be written as,

aₜ = f(U * Xₜ + W* aₜ₋₁ + b)
where,

  • aₜ represents the output generated from the hidden layer at time step t .
  • xₜ is the input at time step t.
  • θ represents a set of learnable parameters(weights and biases).
  • U is the weight matrix governing the connections from the input to the hidden layer; U ∈ θ
  • W is the weight matrix governing the connections from the hidden layer to itself (recurrent connections); W∈ θ
  • V represents the weight associated with connection between hidden layer and output layer; V∈ θ
  • aₜ₋₁ is the output from hidden layer at time t-1.
  • b is the bias vector for the hidden layer; b ∈ θ
  • f is the activation function.

For a finite number of time steps T=4, we can expand the computation graph of a Recurrent Neural Network, illustrated in Figure 3, by applying the equation (1) T-1 times.

a₄ = f(a₃, x₄; θ) — — — — — (2)

Equation (2) can be expanded as,

a₄ = f(U * X₄ + W* a₃ + b)

a₃ = f(U * X₃ + W* a₂ + b)

a₂ = f(U * X₂ + W* a₁ + b)

The output at each time step t, denoted as yₜ is computed based on the hidden state output aₜ using the following formula,

ŷₜ = f(aₜ; θ) — — — — — (3)

Equation (3) can be written as,

ŷₜ = f(V * aₜ + c)

when t=4, ŷ₄ = f(V * a₄ + c)

where,

  • ŷₜ is the output predicted at time step t.
  • V is the weight matrix governing the connections from the hidden layer to the output layer.
  • c is the bias vector for the output layer.

Backpropagation Through Time (BPTT)

Backpropagation involves adjusting the model’s parameters (weights and biases) based on the error between predicted output and the actual target value. The goal of backpropagation is to improve the model’s performance by minimize the loss function. Backpropagation Through Time is a special variant of backpropagation used to train RNNs, where the error is propagated backward through time until the initial time step t=1. Backpropagation involves two key steps: forward pass and backward pass.

  1. Forward Pass: During forward pass, the RNN processes the input sequence through time, from t=1 to t=n, where n is the length of input sequence. In each forward propagation, the following calculation takes place

aₜ = U * Xₜ + W* aₜ₋₁ + b
aₜ = tanh(aₜ)
ŷₜ = softmax(V * aₜ + c)

After processing the entire sequence, RNN generates a sequence of predicted outputs ŷ =[ŷ₁, ŷ₂, …, ŷₜ]. Loss is then computed by comparing predicted output ŷ at each time step with actual target output y. Loss function given by,

L(y, ŷ) = (1/t) * Σ(yₜ — ŷₜ)² — — —— — -> MSE Loss

2. Backward Pass: The backward pass in BPTT involves computing the gradients of the loss function with respect to the network’s parameters (U, W, V and biases) over each time step.

Let’s explore the concept of backpropagation through time by computing the gradients of loss at time step t=4. The figure below also serves as an illustration of backpropagation for time step 4.

fig 4: Back Propagation Through Time (BPTT). Image by Author

Derivative of loss L w.r.t V

Loss L is a function of predicted value ŷ, so using the chain rule ∂L/∂V can be written as,

∂L/∂V = (∂L/∂ŷ) * (∂ŷ/∂V)

Derivative of loss L w.r.t W

Applying the chain rule of derivatives ∂L/∂W can be written as follows: The loss at the 4th time step is dependent upon ŷ due to the fact that the loss is calculated as a function of ŷ, which is in turn dependent on the current time step’s hidden state a₄, a₄ is influenced by both Wand a₃, and again a₃is connected to both a₂and W,and a₂ depends on a₁and also on W.

∂L₄/∂W = (∂L₄/∂ŷ₄ * ∂ŷ₄/∂a₄ * ∂a₄/∂W) + (∂L₄/∂ŷ₄ * ∂ŷ₄/∂a₄ *∂a₄/∂a₃*∂a₃/∂W) + (∂L₄/∂ŷ₄ * ∂ŷ₄/∂a₄ *∂a₄/∂a₃*∂a₃/∂a₂*∂a₂/∂W) + (∂L₄/∂ŷ₄ * ∂ŷ₄/∂a₄ *∂a₄/∂a₃* ∂a₃/∂a₂*∂a₂/∂a₁*∂a₁/∂W)

Derivative of loss L w.r.t U

Similarly, ∂L/∂U can be written as,

∂L₄/∂U = (∂L₄/∂ŷ₄ * ∂ŷ₄/∂a₄ * ∂a₄/∂U) + (∂L₄/∂ŷ₄ * ∂ŷ₄/∂a₄ *∂a₄/∂a₃*∂a₃/∂U) + (∂L₄/∂ŷ₄ * ∂ŷ₄/∂a₄ *∂a₄/∂a₃*∂a₃/∂a₂*∂a₂/∂U) + (∂L₄/∂ŷ₄ * ∂ŷ₄/∂a₄ *∂a₄/∂a₃*∂a₃/∂a₂*∂a₂/∂a₁*∂a₁/∂U)

Here we’re summing up the gradients of loss across all time steps which represents the key difference between BPTT and regular backpropagation approach.

Limitations of RNN

During backpropagation, gradients can become too small, leading to the vanishing gradient problem, or too large, resulting in the exploding gradient problem as they propagate backward through time. In the case of vanishing gradients, the issue is that the gradient may become too small where the network struggles to capture long-term dependencies effectively. It can still converge during training but it may take a very very long time. In contrast, in exploding gradient problem, large gradient can lead to numerical instability during training, causing the model to deviate from the optimal solution and making it difficult for the network to converge to global minima.

fig 5: Gradient Descent. Image Source
fig 6: Vanishing and Exploding Gradient. Image Source

To address these problems, variations of RNN like Long-short term memory (LSTM) and Gated Recurrent Unit (GRU) networks have been introduced.

--

--