Building a Neural Network Zoo From Scratch: The Long Short-Term Memory Network

Gavin Hull
11 min readDec 18, 2022

--

Visualization of the Long Short-Term Memory Network from Asimov Institute.

Long Short-Term Memory (LSTM) networks are one of the most well known types of recurrent neural networks. Originally introduced by Jürgen Schmidhuber and Sepp Hochreiter to learn long-term dependencies in data by tackling the vanishing gradient problem, LSTMs have since been instrumental to the development of many important systems, key among them being Apple’s Siri.

Vanilla RNNs v.s. LSTMs

There are two main differences between Vanilla RNNs and LSTMs: the number of gates and the number of memory cells.

Vanilla RNN Diagram v.s. LSTM Diagram

As explained in my previous article, Vanilla RNNs have one memory cell, called a hidden state (denoted HS in the image above). The hidden state is used to retain information from the past and is what makes an RNN different from any other neural network. As LSTMs are also a type of Recurrent Neural Network, they too have a hidden state, but they have another memory cell called the cell state (denoted CS in the image above). The purpose of this new type of memory is to learn long-term patterns, whereas the hidden state usually retains more short-term information. To better understand why this is, it’s important to first understand the new gates that the LSTM introduces.

The Forget Gate

The purpose of the forget gate is exactly as the name implies: to forget. Like our own memory, the LSTM network can only remember so much. To decide which information to remember and which to forget, the forget gate uses the sigmoid function.

Forget Gate Definition

The forget gate is defined above. It is the sigmoid of the weight matrix (W_f) times the concatenation of the previous hidden state and the current input (Z_t), plus the bias (b_f).

If you remember, the sigmoid function squeezes all numbers between 0 and 1. This is why when the forget gate is multiplied by the cell state, it “chooses” which information to remember. Any data in the cell state that is not considered worth keeping will be multiplied by a number closest to zero in the image above, and any memory considered important will be multiplied by a number closest to one.

For those with an understanding of linear algebra, you may notice that this is not a standard matrix multiplication, and you would be right: this a Hadamard product. This is not particularly important to the network, but it is nice to note that each individual “memory” in the cell state has its own level of importance this way.

The Candidate Gate

The candidate gate is an activation layer being applied to the new information shown to the network.

Candidate Gate Definition

As you can see above, the candidate gate is defined similarly to the forget gate, except it has its own weights and biases (W_c and b_c) and has the hyperbolic tangent function applied instead of sigmoid.

The Input Gate

Like the forget gate, the input gate decides how much information to retain between time steps of the network. The difference is that the input gate decides which information coming into the network to remember as opposed to how much information from memory to remember.

Input Gate Definition

As shown above, the input gate is defined identically to the forget gate, except, of course, it has its own weights and biases (W_i and b_i). Therefore, the input gate has the same effect when it is multiplied by the candidate gate as when the forget gate is multipled by the cell state: it “chooses” which information to remember by multiplying important things by numbers closest to one and unimportant things by numbers closest to zero.

The Output Gate

Contrary to its name, the output gate is not the last gate before the output of the network. In fact, it is the gate used to determine what information is remembered between hidden states.

Output Gate Definition

The output gate is defined above. It is the sigmoid of the sum of the bias (b_o) and the product of the weights and the concatenated input (W_o and Z_t, respectively).

The Final Gate

Unlike the output gate, the final gate is as its name suggests: the final gate. Before the network returns its final answer, it goes through the final gate, which is defined as follows:

The final gate definition.

As there are many activation functions inside the LSTM network, it is not necessary to use an activation function on the final gate. However, applying an activation function will not hurt the network as long as you remember to add the activation function into the backpropagation calculations.

A Brief Recap

The cell state is the network’s long-term memory, while the hidden state is short-term. The forget gate, input gate, and output gate are all used to choose what information is important enough to remember, which is done using the sigmoid function. The candidate gate holds the information from the input and the previous hidden state, which is passed on to the future versions of the network and the final gate is applied to the outputs of the network.

The Math

If you’ve been followng this series since he beginning, you know what I’m about to say: the easiest way to understand the math behind LSTMs is by using computational graphs (CGs). However, the LSTM is quite a bit larger than the networks we’ve looked at previously, which means something has to change.

Diagram showing the construction of a computational graph.

If you’re unfamiliar with CGs, I’ll suggest you take a look at my previous articles, where they’re explained in more depth. For now, I’ll give you a quick refresher. Each node (or circle) in the graph represents an operation and each edge (or line) represents a formula. The forward propagation can be read from left to right in green above each edge. The backpropagation can be read from right to left in red underneath each edge and is calculated as follows: the first error is represented by an e and the rest are the error to the right times the derivative of the function to the right with respect to the function above.

If we were to construct a full CG of the LSTM, it would be so large that the actual calculations would be illegible. For that reason, there’s a simple way of downsizing the CG: divide and conquer.

At each timestep t of the network, the following calculations are made in order.

Forget Gate Definition
Input Gate Definition
Candidate Gate Definition
Output Gate Definition
Cell State Definition
Hidden State Definition
Final Gate Definition

As always with backpropagation, we start from the back and work our way to the beginning.

Final Gate CG.

So we have found the errors for the weights and biases of the final gate to be the following.

Error of the Final Gate Weights.
Error of the Final Gate Biases.

Next we can build a CG of the Hidden State definition.

Hidden State CG.

So we have found the error of the hidden state… but what is e? For the final gate, e will be the error we calculate between the inputs and the labels, but that won’t be the same for the hidden state, will it? This e will be different, but we’ve actually already calculated it. We found the error for the hidden state in the CG for the final gate.

Error of the Hidden State.

This means that we can substitute for e in the errors inside the hidden state.

Error of the Output Gate.
Error of the Cell State.

After calculating the errors inside the final gate, you should now be able to calculate the errors throughout the LSTM. In an effort to keep this article as concise as possible, I will not be explaining each of the individual gates as they are all done the same way. If you are unsure if your calculations are correct, continue on to the coding portion of this article and check your math against the code.

Finally, the Code

As is standard with this series, the only package used for the math behind this network is NumPy. TQDM has also been imported to add a progress bar to the training phase. If you’d prefer to keep this a NumPy only project, you can easily remove TQDM.

LSTMs are now common in the workforce and are used for lots of different things. As they are recurrent neural networks, they are meant for time series prediction, so for this article we’ll be predicting Shakespeare text; more specifically, Hamlet.

char_to_idx and idx_to_char will be used later to implement one hot encoding. As we are working with text, one hot encoding will give us a way to translate our Shakespearean English into numbers for the network. train_X and train_y will be the training data and labels. For every character in train_X there is exactly one character in train_y. That means the network will be trying to predict the next character in a sentence based solely off of the previous character. Sound impossible? That’s the magic of recurrent neural networks.

Next we define a couple of functions that will make our life easier down the road, the first being theoneHotEncode function. If you’re unfamiliar with this process, I’d recommend this article. The other function we’ll be defining is a Xavier Initialization. Recurrent neural networks are very sensitive to the initialization used, so choosing the right one is important. It’s probably enough to know that the Xavier Initialization is a good choice; however, you can take a look at the math yourself here.

There are three activation functions used in LSTMs: sigmoid, tanh and softmax. If you’re familiar with these you might notice that the derivatives are incorrectly defined, but there’s a reason for this.

Tanh derivative definition.
Sigmoid derivative definition.

When the derivative functions are called later in the code, they will be called on variables that have already had tanh of sigmoid applied. For this reason, the following functions are defined.

New tanh derivative definition.
New sigmoid derivative definition.

The LSTM class takes 5 inputs: input_size, hidden_size, output_size, num_epochs, and learning_rate. num_epochs will determine the number of epochs (or training phases) the network goes through, and learning_rate will determine how fast the network adjusts to new information. Each of the network weights and biases are initialized to variables starting with a ‘w’ or ‘b’ respectively, and ending with letters corresponding to ‘forget’, ‘input’, ‘candidate’, ‘output’ and ‘y’ for ‘final’.

To prevent the network from using information it shouldn’t have, the network’s memory must be wiped before each forward propagation. This is analagous to taking flashcards from a student before asking a question: you remove the written answer but not their preexisting knowledge. In the same way, wiping the networks memory prevents it from cheating but not from learning. This memory wiping is done in the reset function. At the same time, we also initialize the hidden_states and cell_states dictionaries with blank entries. The network will still need hidden and cell states when it first starts, even though there is not information to recall from the past. Therefore, we initialize the two memory cells with zeros.

The next function to define is the forward function. This will reset the memory of the network and perform each of the calculations described earlier to obtain outputs. Every gate is stored for later use in its respective dictionary.

Now for the backpropagation function, backward. As was explained in the previous article, recurrent neural networks use backpropagation through time. This means the error for each weight and bias is the sum of its error over time.

After initializing each error to zero, we iterate through each time step backwards, adding the errors to our variables. If you’ve done the math correctly in the previous section, this should all make sense. The one thing that may stand out is line 26. Remember that our tanh derivative function assumes that the input has already had tanh applied to it. This is not the case for cell_states[q], so tanh has to be applied to it before the derivative can be found.

After calculating each of the errors, we clip them with np.clip. This means any values larger than 1 are set equal to 1, and any values less than -1 are set equal to -1. LSTMs generally don’t suffer from the vanishing gradient problem, but they’re still very susceptible to the exploding gradient problem, which is why we do this.

Finally, we can update the weights and biases of the network using the errors found and the learning rate defined.

The train and test functions are standard. train one hot encodes each input and uses forward propagation to obtain the probability of each output being correct. The error is calculated so that the prediction plus the error is equal to the desired output or label. The test function forward propagates the data to obtain the networks probabilities of each character being correct. Then np.random.choice is used to randomly choose a character using the probabilities given by the network. It then prints the ground truth and the predicted text, along with a percent accuracy.

Finally, we can initialize our network. Note that input_size is equal to char_size + hidden_size. This is because our input_size is the size of our network after we concatenate the input and hidden state. Therefore it is the sum of the two sizes.

It is important to note that we are training and testing with the same data. LSTMs are not good at generalizing, which means it would be hard to train one to predict text it had never seen before. As we built this one from scratch there are some constraints, the biggest of which is time. NumPy was not meant for machine learning, which means this network is very inefficient. It’s for this reason that ML packages like Tensorflow and PyTorch exist, so if you’re planning on writing neural networks from scratch for deployment… I wouldn’t recommend it. However, if you’re building networks from scratch to learn, I think that’s the best way to do it- maybe the only way.

That concludes the fourth article in my “Building a Neural Network From Scratch” series. If you’ve been following since the beginning, I hope you’re enjoying the content. If you’re new, have a look at some of my previous articles where I explained Perceptrons, MultiLayer Perceptons and Vanilla RNNs. If you enjoyed this article I’d appreciate it if you shared it with your friends and colleagues, and as always, the full code is available on my GitHub.

Thanks again to Emily Hull for her fantastic editing.

--

--

Gavin Hull

I am a second year Computer Science & Pure Math student at Memorial University, I have been programming for ~7 years and I have a penchant for AI.