Building Multi-Layer LSTM from Scratch

2 min readJun 25, 2024

In this article, We are making a Multi-layer LSTM from scratch for tasks like discussed in RNN article. Referring to them you can model them in any way you want.

First here is the code for a single LSTM Cell in Pytorch:

This choice of standard deviation (std) for weight initialization is based on mathematical considerations and is known as “Xavier initialization”. There are good reasons for this approach, one of them is Variance Preservation.

Variance preservation: The goal is to maintain the variance of activations and gradients as they flow through the network. This helps prevent the vanishing or exploding gradient problem.

This initialization helps ensure that the input to each neuron has unit variance, which is particularly important at the start of training. It helps the network converge faster and avoid issues with very large or very small activations.

Need for non-linearity

The non-linearity is crucial because:

It allows the LSTM to learn and represent complex, non-linear relationships in the data.
Without it, the LSTM would only be capable of learning linear combinations of its inputs.
It helps in preventing the vanishing/exploding gradient problem, especially when using tanh.

Here is a flowchart showing how Data goes through a multi-layer LSTM.

How Data passes through Multi-layer LSTMs

Note : Point-Wise Multiplication is just multiplying respective elements. Its not Matrix Multiplication.

Will publish similar article for GRU and Transformers soon…

Update :
RNN Article — Link
GRU Article — Link
Thank you for reading, I hope this article keeps you updated.

Connect with me :
Linkedin | Github | Medium : Email

Building Multi-Layer LSTM from Scratch

Need for non-linearity

Here is a flowchart showing how Data goes through a multi-layer LSTM.

Written by Yash Bhaskar