Long Short-Term Memory Decoded

Tarik Irshad
Analytics Vidhya
Published in
12 min readApr 14, 2020

Unless you’ve been living under a rock, you’ve probably heard of Artificial Intelligence and how it will be taking over the world in the near future. But how exactly can it benefit our daily lives?

If you’re unfamiliar with AI and Machine Learning, take a look at my previous article for some fundamental background knowledge:

https://www.javatpoint.com/subsets-of-ai

Looking at the diagram of AI branches above, there is a path that leads to Deep Learning via Machine Learning. Within this subcategory of a subcategory of AI (yes, the two subcategories are intentional), there is an even more specific field of Long Short-Term Memory (LSTM).

LSTM makes up a very small portion of AI. Before we get there, let’s take a look at the broader subfields of AI that lead to it!

https://blog.knoldus.com/data-processing-and-using-ml-supervised-classification-algorithm-and-find-accuracy/
  1. Machine Learning is the branch of AI that is based on the idea that machines can learn from inputted data themselves. They can then identify patterns and make decisions based on the given data with minimal human intervention. Put simply, machine learning’s goal is to get machines to act without being explicitly programmed.
  2. Deep Learning is the branch of machine learning where self-teaching systems use existing data to make predictions about new data through patterns found by algorithms. This process is completed by artificial neural networks that mimic the neurons in the human brain. It looks like:
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6

Neural networks have three layers: the input layer, the hidden layer, and the output layer.

  1. The Input Layer brings the initial data into the system to be processed at later layers.
  2. The Hidden Layer is where all the “deep learning” actually occurs, and it is hidden in between the input layer and the output layer. The hidden layer works by performing computations on the weighted inputs to produce a net input which is then applied with a non-linear activation function to produce the final output. It is important to note that in order for it to be deep learning, there must be more than one hidden layer.
  3. The Output Layer simply produces the results for the given input.

However, when trying to use traditional neural networks, you may encounter a critical problem: they cannot use information realized from previous trials to help inform their future decisions. Hence, traditional neural networks are not equipped to handle sequence data.

This is why we have recurrent neural networks!

Recurrent Neural Networks (RNNs)

RNNs are a type of deep learning that contains loops, allowing previous information to persist–just like in the human brain.

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

The diagram to the left shows an RNN where the hidden state (A) takes the input (xt) and produces an output value (ht). The difference between an RNN and a traditional neural network is the loop attached to the hidden state.

Grasping the concept of the loop might be hard from this diagram alone, so let’s unroll the loop:

Adding the loop simply turns the one feedforward neural network into a sequence of neural networks which all are able to access information from previous neural networks. Essentially, the information gathered from the hidden state of one neural network gets passed forward to the next neural network enabling it to use this information when determining the output value it should produce.

https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9

Another way to look at an RNN is by thinking of the loop as a road allowing information to flow from one hidden state to the next. This is shown in the diagram to the left where the moving blue circle is the prior information.

As previously alluded to, RNNs are helpful when dealing with sequential data due to their loop. Because of this, there are a variety of useful RNN applications ranging from language modelling to image captioning. If you want to learn more about using RNNs, visit Andrej Kaparthy’s blog post:

However, RNNs are not perfect either. They, too, face a significant problem: short-term memory. This means that information from earlier neural networks will not persist and won’t be able to be used by much later neural networks. This is due to the vanishing gradient problem.

Backpropagation and the Vanishing Gradient Problem

To understand the vanishing gradient problem, you must first understand backpropagation, the algorithm used to train RNNs. Backpropagation is an example of supervised learning where the RNN will predict the output of already labelled data points. A loss function will then compare the predictions to the correct output and output an error value. Finally, the RNN uses the error value to backpropagate — calculating the gradients for each node in the RNN.

The gradient is the value that allows the RNNs to learn by adjusting its internal weightings. The gradient shrinks as it backpropagates because the gradient calculates its value with respect to the gradient of the previous node. If the previous gradient is small, the current gradient will be even smaller. This causes there to be a smaller adjustment to the internal weightings meaning that nodes will learn less and less as backpropagation occurs, and the earliest nodes will barely learn due to the vanishing gradient.

The vanishing gradient causes RNNs to not be capable of handling long-term dependencies. The example below proves why this is a problem:

Say, for example, we wanted to predict the last word in the text: “I grew up in France… I speak fluent French.” An RNN would realize the word was a language because of the recent prior information, but what language? The RNN would need more context to connect France with the word being French.

Thankfully, long short-term memory doesn’t have this problem!

Long Short-Term Memory (LSTM)

LSTM networks (LSTMs) are a special type of RNNs that mitigates the vanishing gradient problem by manipulating its memory state. Their ability to do so lies in their architecture.

Reexamining RNNs, the architecture of a single cell is shown below:

The previous hidden state (ht-1) combines with the input (xt) to form a vector [(ht-1)+(xt)]. The vector contains information on the previous inputs and the current input. The vector is then activated by a tanh function (producing a value from -1 to 1) to form the new hidden state (ht). The new hidden state then moves on to the next cell to complete the same process.

In contrast, LSTMs have a much more complex architecture:

Core Concept

What allows LSTMs to bypass short-term memory is known as the cell state. This is the horizontal line at the top of the diagram running through the entire cell. It can be thought of as highway transporting relevant information straight down the entire chain. This means information from the first cell can reach all the way the end of the chain.

LSTMs have the ability to add to or remove from the cell state. This operation occurs through and is regulated by gates: neural networks that decide which information should be on the cell state. Gates are composed of a sigmoid function layer and a pointwise multiplication layer.

The sigmoid function layer produces a number from 0 to 1 which describes how much of each component should be let through. A value of 0 corresponds to letting nothing through while a value of 1 corresponds to letting everything through.

A single cell in an LSTM contains three types of gates: the forget gate, input gate, and output gate.

Forget Gate

Going back to the architecture of an LSTM, the forget gate is the first step since it determines what information from the previous cell state (Ct-1) will be thrown away or kept. The previous hidden state (ht-1) combines with the input (xt) to form the vector [(ht-1)+(xt)]. It then goes through the sigmoid function layer which, again, produces a value between 0 and 1 for each number in the previous cell state with 0 representing to throw away and 1 representing to keep. Below, the process is summarized as an equation:

ft = sigmoid(Wf * [ht-1, xt] + bf)

Input Gate

The vector then moves along to the next step: the input gate. The input gate decides what new information will be stored on the cell state. This process has two steps. First, the vector goes through the sigmoid function layer to determine which values will be updated using the aforementioned process. Next, the vector is passed through a tanh function so that the values range from -1 to 1 creating a candidate vector (C̃t). Finally, the sigmoid function output is multiplied with the tanh function output where the sigmoid function output determines which values are to be kept from the tanh function output. Below, the process is summarized as two equations:

it = sigmoid(Wi * [ht-1, xt] + bi)

C̃t = tanh(Wc * [ht-1, xt] +bC)

Cell State

There now is enough information to update the old cell state to the new cell state (Ct). We first multiply the old cell state by the forget gate output (ft) which forgets the information we decided wasn’t important earlier. We then multiply the input gate output (it) by the candidate values scaling the values by how much we decided to update them earlier. Finally, we add the scaled candidate values with the updated cell state to give us the new cell state. Below, the process is summarized as an equation:

Ct = ft * Ct-1 + it * C̃t

Output Gate

Lastly, we need to decide the next hidden layer. First, the vector passes through a sigmoid function layer to determine which of the updated cell state values should be outputted. The result is the output gate output (ot). Then, we pass the cell state through a tanh function that compresses the values between -1 and 1. Finally, we multiply these two outputs to determine the final output and the next hidden layer. Below, the process is summarized as two equations:

ot = sigmoid(Wo * [ht-1, xt] +bo)

ht = ot * tanh(Ct)

In summary, LSTMs use gates to filter unimportant information out of the cell state’s storage so relevant information can be passed throughout the LSTM’s chain of cells.

Now that we know what LSTMs are and how they work, let’s take a look at some of their applications!

Applications

LSTM has endless applications in the real-world that range from handwriting recognition to robot control. Two of its main applications are speech recognition and time series prediction.

1. Speech Recognition

Have you ever wondered how your computer has the ability to translate your words into text or how Siri can understand what you’re saying? Well, it’s partially thanks to LSTMs.

In order to identify and process the human voice, machines use a process known as automatic speech recognition (ASR). A diagram for ARS is shown below:

https://towardsdatascience.com/hello-world-in-speech-recognition-b2f43b6c5871

After the Convolutional Neural Network (CNN) layer extracts features from the spectrogram vector inputs and creates feature vectors, the LSTM layer is necessary to process these feature vectors and provide an output to the Fully Connected Layer.

Because of an LSTM’s capabilities, it is able to analyze the speech recognition machine’s output and determine whether or not the recognized words make sense within their context. This helps speech recognition machines to create proper sentences.

2. Time Series Prediction

A time series is simply a series of data points indexed in time order. In a time series, time is the independent variable. A prominent example of a time series is a stock chart, like the one shown below:

https://www.investopedia.com/ask/answers/081314/whats-most-expensive-stock-all-time.asp

Since the chart illustrates how the stock price of BRK/A has changed over a set period of time, it is a time series.

LSTMs are capable of predicting how the dependant variable, in this case, the stock price, will change in the future as time goes on.

Because of an LSTM’s architecture that we previously discussed, LSTMs are able to look for patterns throughout the series of data points already given. It is able to use patterns found even from the very beginning of the time series to predict what will happen to the stock price. This is what makes it so effective. If you are interested in looking deeper into the theory and coding behind building a time series predictor check out Marco Peixeiro’s article:

Bonus: Predicting Baseball Signs

For any baseball fans out there, like myself, you may know that while runners are on the base path, a 3rd base coach will give them signs as to whether or not to steal. With the recent news about the Astros stealing signs via cameras, we now can see how much of an edge sign-stealing can give opposing teams.

But simply recording and reviewing the signs isn’t enough. In order to understand the signs, you must decipher which signals correspond to what while ignoring the bluffs.

Fortunately for any aspiring sign-stealers, Mark Rober along with Jabrils, two YouTubers and engineers, created a Machine Learning app that uses LSTMs to determine the sign for stealing when given the input of the coaches signals and whether or not the player stole in multiple plays.

If you would like to look deeper into their project, the video they created is linked below:

Thank you for taking the time to read my article about LSTMs. I hope you enjoyed it and learned something new! If you are looking to do some research of your own, I would recommend checking out these sites (which I also used as sources):

--

--