Why Can’t We all be more like LSTMs?

Nifesimi Ademoye
Analytics Vidhya
Published in
10 min readJul 31, 2021

Yes, I know, LSTMs are dead. Long live Transformers!! , Blah Blah Blah. All that might be true and all, but I still think the logic of LSTM is something fascinating and is something I have been trying to apply to my life daily. For those of you reading this that have no idea what LSTMs are, well, you are in luck because we are about to do a deep dive into what LSTM means.

LSTM or Long short-term memory is an artificial recurrent neural network architecture used in deep learning. Unlike standard feedforward neural networks. LSTM has feedback connections. It can not only process single data points but also entire sequences of data points.

But firstly, to understand Long Short Term Memory (LSTM), you must understand Recurrent Neural Network (RNN), a special kind of RNN’s.

The concept of RNNs is to make use of sequential information. Usually, when working with neural networks, we assume that all inputs (and outputs) are independent of each other. But for many tasks, that’s are a horrible idea. For example, if you want to predict the next word in a sentence, you better know which words came before it. RNN is recurrent in nature as it performs the same function for every input of data while the output of the current information depends on the past computation. Therefore, RNNs are best suited for sequential data. It can handle arbitrary input/output lengths. RNN uses its internal memory to process random sequences of inputs. This makes RNNs best suited for predicting what comes next in a sequence of words. Like a human brain, particularly in conversations, more weight is given to the recency of information to anticipate sentences.

Drawbacks of RNNs

One of the appeals of RNNs is that they might be able to connect previous information to the present task, such as using previous video frames to inform the understanding of the current frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.

Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context — it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large. Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

In theory, RNNs are capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult. Thankfully, LSTMs don’t have this problem.

Core Concept of LSTMS

Long Short Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of learning long-term dependencies. They work tremendously well on a large variety of problems and are now widely used. An LSTM has a similar control flow as a recurrent neural network. It processes data passing on information as it propagates forward. The differences are the operations within the LSTM cells. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods is practically their default behavior, not something they struggle to learn!

The core concept of LSTM’s is the cell state and its various gates. The cell state act as a transport highway that transfers relative information down the sequence chain. You can think of it as the “memory” of the network. The cell state, in theory, can carry relevant information throughout the processing of the sequence. So even information from the earlier time steps can make its way to later time steps, reducing the effects of short-term memory. As the cell state goes on its journey, information get’s added or removed to the cell state via gates. The gates are different neural networks that decide which information is allowed on the cell state. The gates can learn what information is relevant to keep or forget during training.

Let’s dig a little deeper into what the various gates are doing, shall we? So we have three different gates that regulate information flow in an LSTM cell. A forget gate, input gate, and output gate.

Forget gate

This gate decides which information is essential and should be stored and which information to forget. This gate chooses what information should be thrown away or kept. It removes the non-important information from neuron cells. This results in optimization of performance. This gate takes 2 inputs- one is the output generated by the previous cell, and the other is an input of the current cell. Information from the previous hidden state and information from the current input is passed through the sigmoid function. Values come out between 0 and 1. The closer to 0 means to forget, and the closer to 1 means to keep.

Forget gate operations

Input Gate

This gate is used to add information to neuron cells. It is responsible for what values should be added to the cell using an activation function like a sigmoid. It creates an array of information that has to be added. That decides which values will be updated by transforming the values between 0 and 1. 0 means not important, and 1 means important. You also pass the hidden state and current input into the tanh function to squish values between -1 and 1 to help regulate the network. Then you multiply the tanh output with the sigmoid output. The sigmoid output will decide which information is important to keep from the tanh output.

Input gate operations

Cell State

Now we should have enough information to calculate the cell state. First, the cell state gets pointwise multiplied by the forget vector. This can drop values in the cell state if it gets multiplied by values near 0. Then we take the output from the input gate and do a pointwise addition which updates the cell state to new values that the neural network finds relevant. That gives us our new cell state.

Calculating cell state

Output Gate

This gate is responsible for selecting important information from the current cell and show it as output. It creates a vector of values using the tanh function, which ranges from -1 to 1. The output gate decides what the next hidden state should be. Remember that the hidden state contains information on previous inputs. The hidden state is also used for predictions. First, we pass the previous hidden state and the current input into a sigmoid function. Then we pass the newly modified cell state to the tanh function. Finally, we multiply the tanh output with the sigmoid output to decide what information the hidden state should carry. The output is the hidden state. The new cell state and the new hidden are then carried over to the next time step.

To review, the Forget gate decides what is relevant to keep from prior steps. Next, the input gate decides what information is relevant to add from the current step. Finally, the output gate determines what the next hidden state should be.

What does this all mean?

Now at this point, you are probably wondering how this all relates to you. If you think about what makes LSTMs unique, it can use the valuable part of the information it remembers from long periods and not get bogged down by that weight of the past data it does not need at the moment. Essentially LSTMs draw context (insight and lessons) from the past data exposed to use on its present tasks. If you think about it, Wouldn’t we all essentially be better if we disciplined our minds to work like that? I mean, we always let our past actions taint or blur our reflections of ourselves instead of drawing lessons from them and moving on.

Suppose you think critically about what the forget gate does in LSTMs. It decides which information is important and should be stored and which information to dismiss if we aspire to high emotional intelligence that leads to career success and life satisfaction. We must learn how to banish the negatives thoughts that come from our past mistakes and focus on making sure we don’t repeat such mistakes in the present. We are our worst critics and are so judgmental of ourselves about our past decisions. It’s as if we must be harder with ourselves and less tolerant of our human faults and emotions than we are with others.

Rumination is the thief of self-compassion. There are so many times I have found myself dwelling on the mistakes I have made in the past listening to my negative thought chatter. Unconsciously, I might be feeling low on some days, and it turns out I have let this negative chatter start brewing in my unconscious, leaving me in a depressed state of mind. So often, We tend not “forget “the mistakes we have made in the past and just ruminate and drown in our self-pity.

How To Be More Like LSTMs

First, we have to learn to forget and truly forgive ourselves for our past actions and only use the lessons or insights gleaned from our past to better our present and make sure we are not repeating the same mistakes. We all make mistakes. We all fall short. We all can find something in our past that makes us hate ourselves. That’s painfully easy.

The hard part is actually to love ourselves despite our wrongdoings. But, as Mahatma Gandhi once said, forgiveness takes strength — and sometimes no greater than the forgiveness we extend to ourselves.

“You can sit there forever, lamenting about how bad you’ve been, feeling guilty until you die, and not one tiny slice of that guilt will do anything to change a single thing in the past. Forgive yourself, then move on!” — Wayne Dyer.

Niklas Göke writes, part of what makes true success possible.

  • Forgiveness brings us some release from the pains that would crush us.
  • Forgiveness allows us to heal the wounds we’ve caused ourselves.
  • Forgiveness gives us room to grow again, into a higher expression of ourselves.

The pain won’t always go away, nor will the memories, but we can all choose to let go. And there’s something extremely important in that beyond simply moving on. In self-forgiveness, you usher in a re-establishment of the most important relationship of your life:

Your relationship with yourself.

So as you go on whatever journey you find yourself in, Remember that like an LSTM, to truly reach success in your goals, you have to let go of your guilt cause I don’t think life is not about finding the straightest path to success. Or the simplest. Or even the smoothest. Maybe, it’s about finding one, just one, that allows you to get there at all.

Note that forgiving yourself is not about letting yourself off the hook. It’s not an excuse to not learn from your mistakes. It won’t guarantee it either, but without forgiveness, you can’t learn anything. Because regret is in the way. You must say: “Okay, that’s done; how will I move on, and what will I change?”

References

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

--

--