Demystifying Sequence Modeling: Understanding RNNs, LSTMs, and Seq2Seq

Zain ul Abideen
5 min readJun 26, 2023

--

Exploring the fundamentals and applications of sequence modeling.

Introduction

Sequence Modeling is an important problem across many domains, including Natural Language Processing (NLP), Speech Recognition and Speech Synthesis, Time Series Forecasting, Music Generation, and Bioinformatics. What is common in all these tasks is that they require persistence. The prediction of the next thing is based on history. For example, in the sequence “Hasan used to play football and he was pretty good at it”. The prediction of ‘he’ can only be done if the information regarding ‘Hasan’ is carried forward to that specific point. So, you need some sort of history block that can store previous information and carry it forward for further predictions. The traditional ANNs fail at this because they cannot carry previous information. This gave birth to a new architecture called “Recurrent Neural Networks (RNNs)”.

Recurrent Neural Networks

A recurrent neural network is a type of deep learning neural net that remembers the input sequence, stores it in memory states, and predicts future words/sentences. They have loops in them, allowing information to persist.

Image source: Sebastian Raschka, Vahid Mirjalili. Python Machine Learning

A single layer RNN shown above has input x and output y with hidden unit h. The right part of the diagram shows RNN in an unfolded way. Considering the case of hidden unit h(t); It receives two inputs. One is x(t) and the other is h(t-1). In this way, information is carried forward.

There are different types of sequence problems for which modified versions of this RNN architecture are used. Sequence problems can be broadly classified into the following categories:

The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy

Many to many architecture can be used for video captioning and machine translation. One to many for image captioning and many to one for sentiment analysis tasks. These are just a few applications of these modified architectures.

Drawbacks of RNNs

  1. Vanishing/Exploding Gradients: The total loss is the sum of loss across all timestamps. So during the backpropagation, we take the partial derivative with respect to weights. Applying the chain rule, ends up giving us a form in which we are calculating a product of partial derivatives of hidden states at adjacent timestamps. Due to this multiplication, our gradients can diminish exponentially and updates in parameters are considerably small. This is Vanishing gradient problem. When the gradients grow exponentially, updates in parameters are unstable and unpredictable. This is Exploding gradients problem. Both problems hinder the training of RNNs.
  2. Long-Term Dependencies: RNNs can easily pass the information across small gaps but when the dependence of last word is on first word of a long sentence then the RNNs fail due to a big gap.

To overcome the above problems we can use gradient clipping, skip connections, weight initialization techniques, gradient regularization and gated architectures like LSTMs and GRUs.

Long Short Term Memory

LSTM is a type of deep learning neural net that has two different states hidden state and cell state. It has three different types of gates that are input, forget, and output gates. These gates regulate the flow of information into and out of the memory cell, allowing LSTMs to selectively remember or forget information as needed.

Image source: Sebastian Raschka, Vahid Mirjalili. Python Machine Learning

Now I’ll be explaining the working of LSTMs. One LSTM cell takes input x(t), hidden state h(t-1), cell state c(t-1). Based on h(t-1) and x(t), it first decides what information are we going to throw away by using forget gate. Then we decide which new information should be stored in cell state. We find this by using input gate and input node. Then we update the cell state c(t-1) by first forgetting and then adding new information. In the end, we compute the output using updated cell state c(t), input x(t), hidden state h(t), and output gate.

LSTMs face overfitting, memory limitations, and computational complexity. A lot of minor modifications have been suggested to LSTM architecture. One architecture is Gated Recurrent Unit (GRU):

GRU

Sequence-to-Sequence

Seq2Seq is a special type of sequence modeling that is used for machine translation, text generation, summarization, and more. Its architecture is designed in such a way that it can take a variable amount of inputs and produce a variable amount of outputs. It has an encoder and decoder. Both the encoder and decoder have one recurrent neural network.

Sequence-to-sequence learning with an RNN encoder and an RNN decoder

In the above diagram, you can view that the encoder takes one input token at each timestamp and then it updates its hidden state. All the information captured by the encoder from the given sentence is passed onto the decoder by the last hidden state of the encoder. This last hidden state is called the context vector. It serves as a summary of the entire input sequence. The decoder RNN takes the context vector produced by the encoder and generates the output sequence token by token. At each time step, the decoder receives the previous output token (or a start token during the initial time step) and its hidden state. The decoder’s hidden state is updated based on the previous hidden state and the previously generated token. The decoder generates the output sequence token by token until a specific condition is met, such as reaching a maximum length or producing an end-of-sequence token.

Drawbacks of Seq2Seq

  1. Context compression: All the information from the input sequence has to be compressed to the size of the context vector. Due to this, there is a loss of fine-grained details.
  2. Short-term memory limitation: They struggle to capture and retain information from distant time steps, leading to difficulty in handling long sequences and capturing long-term dependencies.
  3. Exposure Bias: During training, Seq2Seq models are often trained using a technique called teacher forcing, where the decoder is provided with the ground truth output tokens as inputs at each time step. However, during inference or testing, the model generates output tokens based on its own predictions. This discrepancy between training and inference can lead to exposure bias, causing the model to perform sub-optimally during inference.

Closing Remarks

Taking all the above into consideration, Recurrent neural networks did make a big change in sequence modeling. To overcome its demerits, we came up with LSTMs and GRUs. But the most revolutionary change came in machine learning with the advent of the attention mechanism. In the next blog post, I will be covering in detail the attention mechanism and Bahdanau Attention.

Thank you for reading!

Follow me on LinkedIn!

--

--