HOW LSTM TOOK OVER RNN?

Chithra
4 min readJan 17, 2024

--

After the realization that the vanishing gradient problem posed a significant limitation, Recurrent Neural Networks (RNNs) have increasingly been supplanted by Long Short-Term Memory networks (LSTMs). LSTMs address the vanishing gradient issue, enhancing their effectiveness in sequential data analysis. These networks remain pivotal in the field of deep learning, demonstrating superior accuracy and speed. This blog explores the nuances between RNNs and LSTMs, offering insights into their respective strengths and weaknesses within the context of sequential data analysis.

Long Short-Term Memory (LSTM) stands as a pivotal component in the realm of Artificial Intelligence and deep learning, setting itself apart from standard feedforward neural networks through the integration of feedback connections.

The term “LSTM” reflects its dual-memory nature, encompassing both “long-term memory” and “short-term memory.”

During training episodes, the network undergoes changes in weights and biases, similar to how our brains store long-term memories. It also adapts in the short term, much like how our brains quickly remember things for a short period. LSTM has specific parts, like a cell, input gate, output gate, and forgotten gate, working together to manage information. This design makes it great for tasks involving sequences, especially when there are delays between important events.

Developed to address the vanishing gradient problem encountered during the training of traditional RNNs, LSTMs exhibit relative insensitivity to gap length, providing a notable advantage over RNNs, hidden Markov models, and other sequential learning methods across various applications.

LSTM ARCHITECTURE

Long Short-Term Memory networks (LSTMs) take a different approach by making subtle adjustments to information through multiplications and additions.

LSTMs introduce a mechanism called cell states, allowing information to flow selectively. This unique capability enables LSTMs to remember or forget things as needed.

Within a specific cell state, the information is influenced by three different dependencies. In essence, LSTMs overcome the challenge of retaining relevant context over longer sequences, making them more adept at understanding and leveraging intricate dependencies in data.

The dependencies can be generalized to any problem as:

  1. The previous cell state (i.e. the information that was present in the memory after the previous time step)
  2. The previous hidden state (i.e. this is the same as the output of the previous cell)
  3. The input at the current time step (i.e. the new information that is being fed in at that moment)
LSTM UNIT

Think of LSTMs as assembly line workers in a manufacturing plant.

In an assembly line, products move along a continuous conveyor system, passing through different workstations where workers (LSTMs) contribute to the assembly process. Each worker has a specific task, such as adding components, adjusting details, or even removing unnecessary parts. As the product advances, it undergoes continuous modifications and enhancements.

Similarly, LSTMs operate in a sequential manner, processing information as it moves through various stages, akin to the continuous evolution of a product on an assembly line. This example illustrates how LSTMs excel at managing and adapting information through a continuous flow, making them powerful tools for tasks involving sequential data.

ANALOGY

Let’s explore LSTMs using another analogy of solving puzzles.

Imagine you’re solving a puzzle and you have some strategies in mind. Initially, you attempt to solve the puzzle with a particular set of clues, but suddenly, a new hint emerges, leading you to consider an alternative solution. You immediately forget the previous strategy, reorganizing your potential solutions and discarding the initial line of thinking.

Now, imagine discovering a unique pattern or strategy that proves effective in solving the puzzle. You immediately input this new pattern into your head.

Now all these broken pieces of information can’t help you win the puzzle. So, after a certain time, you would have a summarized and refined set of strategies, which output the relevant breakthroughs and effective approaches. This reflects the functioning of Long Short-Term Memory networks (LSTMs) within the context of solving puzzles.

Forget Gate (f): Combining input with the previous output, the forget gate generates a fraction between 0 and 1. This fraction determines how much of the previous state to retain or forget. An output of 1.0 means “remember everything,” and 0.0 means “forget everything,” suggesting an alternative name, the “remember gate.”

Input Gate (i): Similar to the forget gate, the input gate decides which new information enters the LSTM state. The gate’s output, also a fraction between 0 and 1, is multiplied with the tanh block’s output. This product, a gated vector, is then added to the previous state to create the current state.

Output Gate (o): The output gate follows a similar process of gating the input and previous state to generate a scaling fraction. This fraction is then combined with the output of the tanh block, which represents the current state. The resulting output is then released. Both the output and the state are looped back into the LSTM block for further processing.

CONCLUSION

The key distinction between RNNs and LSTMs (Long Short-Term Memory) lies in their capability to manage long-range dependencies within sequential data. RNNs struggle with vanishing gradient issues, hindering their ability to effectively capture information over extended sequences. In contrast, LSTMs utilize an advanced architecture featuring memory cells and gating mechanisms, enabling them to better capture and retain crucial information across longer sequences.

The primary drawback I’ve observed with LSTMs is their challenging training process. Training even a basic model demands a significant investment of time and system resources. However, it’s essential to note that this challenge often stems from hardware constraints rather than an inherent flaw in the LSTM architecture.

--

--