LSTM : What’s the fuss about?

LSTM learns. LSTM remembers. Be like LSTM.

Etqad Khan
Analytics Vidhya
3 min readJan 17, 2020

--

Recurrent Neural Networks (RNNs) are good at learning sequences, they carry information from previous state to the next one, across time steps. Alas, they provide good results only when the sequences are not too long. Thus suffering from a short-term memory problem.

Moreover, during Back Propagation, a Recurrent Neural Network might also suffer from a Vanishing Gradient Problem. In Back Propagation algorithm, weights are updated based on the loss calculated in the end, in a case where the change in weight is a smaller value, it would mean that the update over time steps would become further smaller ( almost zero), and the layers in the network would not learn in such a case!

Vanishing Gradient Condition in RNN

This problems in RNNs is tackled by Long Short-Term Memory (LSTM), which is a special type of RNN that uses gates and increased number of interactions to keep this issue at bay. LSTM also increase a network’s capability of learning longer sequences. If I have to state in simpler words, it remembers only the relevant information and forgets the unnecessary information at every state, thus enhancing its knowledge bank with minimum memory usage. We’ll see it further when we go into an LSTM’s cell structure.

LSTM Cell Structure

The horizontal line that goes from the top is the Cell State, you can think of it like the “memory” of the network. All the relevant information moves through the cell state, at every time step, information gets added to it or removed from it via gates that regulate the flow of information in LSTM cells. These gates are Forget Gate, Input Gate, Output Gate.

Forget Gate: This gate decides what is the relevant information. It takes information from the previous state and from current time step and passes it through a Sigmoid Function. The range of output is between 0 to 1, if closer to zero, it is forgotten, else kept.

Input Gate: This gate calculates information necessary to update the cell state. It works in two steps, the first step is the same as the operation performed at the forget gate, where the [information from previous state and current input] is passed from a Sigmoid function, the second step involves passing the same set of information through a tanh function (range of values between -1 to 1). These two outputs are then multiplied.

The result of the above mentioned operations goes under point-wise addition with the output of point-wise multiplication of Cell State and Forget Gate’s output, and the output is the new Cell State value.

Output Gate: Output Gate also operates in two steps, the first step is the same as the Forget Gate (same as step 1 for Input gate), while in the second step, the information from the newly updated Cell State (discussed in Input Gate) is taken and passed through a tanh function.

These two outputs are then multiplied to obtain the new hidden state and is carried forward to the next cell. Thus, this is how an LSTM carries information from one cell to another.

Now we have a basic intuition on how an LSTM cell works. This would make us a little more comfortable while working with them. I’d try to come up with a post with a code example. Fingers crossed.

If you’ve come all the way down here, I’d take a moment to thank you for actually putting efforts into reading it. I hope it helps someone in learning about LSTMs :)

--

--