What is LSTM? Introduction to Long Short-Term Memory

Rebeen Hamad
6 min readDec 3, 2023

--

Long Short-Term Memory (LSTM) is a type of artificial recurrent neural network (RNN) architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTMs have feedback connections, allowing them to exploit temporal dependencies across sequences of data. LSTM is designed to handle the issue of vanishing or exploding gradients, which can occur when training traditional RNNs on sequences of data. This makes them well-suited for tasks involving sequential data, such as natural language processing (NLP), speech recognition, and time series forecasting

LSTM networks introduce memory cells, which have the ability to retain information over long sequences. Each memory cell has three main components: an input gate, a forget gate, and an output gate. These gates help regulate the flow of information in and out of the memory cell.

The input gate determines how much of the new input should be stored in the memory cell. It takes the current input and the previous hidden state as inputs, and outputs a value between 0 and 1 for each element of the memory cell.

The forget gate decides which information to discard from the memory cell. It takes the current input and the previous hidden state as inputs, and outputs a value between 0 and 1 for each element of the memory cell. A value of 0 means the information is ignored, while a value of 1 means it is retained.

The output gate controls how much of the memory cell’s content should be used to compute the hidden state. It takes the current input and the previous hidden state as inputs, and outputs a value between 0 and 1 for each element of the memory cell.

By using these gates, LSTM networks can selectively store, update, and retrieve information over long sequences. This makes them particularly effective for tasks that require modeling long-term dependencies, such as speech recognition, language translation, and sentiment analysis.

To summaries these gates,

  1. Forget Gate:
  • Determines what information to discard from the cell state.
  • It takes input (current time step and previous hidden state) and produces a number between 0 and 1 for each number in the cell state. 1 represents “completely keep this” while 0 represents “completely get rid of this.”

2. Input Gate:

. Decides what new information to store in the cell state.

. It consists of two parts:

a. A sigmoid layer (the “input gate layer”) that decides which values to update.

b. A tanh layer (which creates a vector of new candidate values to add to the cell state).

3. Output Gate:

  • Determines the next hidden state based on the updated cell state.
  • Filters the information that the LSTM will output based on the updated cell state.

Key components of LSTM:

  • Cell State:
  • This runs straight down the entire chain of the LSTM, with only some minor linear interactions. It’s the core differentiator in LSTMs that allows them to maintain and control long-term dependencies.
  • Hidden State:
  • The LSTM’s output at a particular time step based on the cell state.
LSTM Cell

LSTMs use these gates to regulate the flow of information, which allows them to learn long-term dependencies in data, making them particularly effective for tasks involving sequential data like time series prediction, natural language processing, speech recognition, and more.

By controlling and memorizing information over long sequences, LSTMs can mitigate the problems of vanishing and exploding gradients, enabling more effective training and better capturing of long-term patterns in sequential data

Preparing input data for LSTM

Preparing input data for an LSTM involves organizing your data into a format that an LSTM model can ingest and process effectively. LSTMs, being a type of recurrent neural network, are suited for sequence data. The basic steps for preparing input data for an LSTM are as follows:

  • Sequences: LSTM models work with sequences of data. Organize your input data into sequences of fixed length. For instance, in the context of time series data, if you have daily data, you might create sequences of, say, 10 days’ worth of data as one input sequence.
  • Reshape your data to be in a 3D format: (samples, time steps, features). For instance, if your data is in the form of a 2D matrix (samples, features), you’ll need to reshape it so that the LSTM can interpret it as sequences of data.
  • Samples: Number of data points in your dataset.
  • Time Steps: Number of time steps in each sequence.
  • Features: Number of features at each time step

To convert a normal dataset into a format suitable for input to an LSTM model, you’ll generally need to restructure the data into sequences to form the input features and target values. Here’s a step-by-step guide on how to change a typical dataset into a format appropriate for an LSTM model:

Assuming you have a dataset in the form of a Pandas DataFrame or a NumPy array and you want to predict the next value in a time series based on previous values:

  1. Organize the Data into Sequences:
  • Decide on the sequence length: Determine how many previous time steps you want your LSTM model to consider when predicting the next value.
  • Create sequences from your dataset by shifting the data by the determined sequence length. This creates pairs of input features and target values.

2. Reshape the Data:

Reshape the data to fit the (samples, time steps, features) format expected by the LSTM model.

Here’s an example

import numpy as np

# Example dataset (1D array representing a time series)
data = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100]) # Replace this with your dataset

# Parameters
sequence_length = 3 # Define the sequence length (how many time steps to consider)

# Create sequences
sequences = []
target = []
for i in range(len(data) - sequence_length):
sequences.append(data[i:i + sequence_length])
target.append(data[i + sequence_length])

# Reshape sequences and target to fit LSTM input format
X = np.array(sequences)
y = np.array(target)

# Reshape for LSTM input (if data is 1D)
X = X.reshape(X.shape[0], sequence_length, 1)

# Output shapes
print("X shape:", X.shape) # Input shape for LSTM
print("y shape:", y.shape) # Target shape
  • Time Steps in LSTM: Time steps refer to the individual points in the sequence. In a time series or sequence data, each element or observation at a specific moment is considered a time step. For instance, if you’re dealing with a time series of stock prices and you’re predicting the price for the next day based on the past five days, each day within those five days represents a time step. In the context of LSTM, the model considers information across these sequential time steps to learn patterns and make predictions.
  • Sequence Data: Sequence data, in the context of LSTM, is a dataset that is structured in a sequential manner. It can be any data that is ordered and depends on its past values to predict future values. This could include time series data, natural language text, DNA sequences, or any ordered set of data where the order matters. LSTM models are particularly effective for learning and predicting on sequence data because they can retain memory over long sequences, capturing dependencies and patterns over time

In the below example we use multi-layer LSTM to recognize digits:

LSTM architecture for digit recognition, LSTM treats each image row as a sequence of pixels. Here’s an example using TensorFlow and Keras with a multi-layer LSTM model:

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

# Load and preprocess the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

# Reshape the data for LSTM input
time_steps, features = x_train.shape[1], x_train.shape[2]
x_train = x_train.reshape(x_train.shape[0], time_steps, features)
x_test = x_test.reshape(x_test.shape[0], time_steps, features)

# Build the multi-layer LSTM model
model = Sequential()
model.add(LSTM(128, input_shape=(time_steps, features), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128))
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=5, batch_size=32, validation_data=(x_test, y_test))

# Evaluate the model on the test set
accuracy = model.evaluate(x_test, y_test)[1]
print(f'Test Accuracy: {accuracy * 100:.2f}%')

In this example, each image row is treated as a sequence, and the LSTM layers are stacked on top of each other. The return_sequences=True parameter in the first two LSTM layers is set to ensure that they return the full sequence of outputs rather than just the output at the last time step. Dropout layers are added to prevent overfitting. The final dense layer outputs the digit prediction.

Keep in mind that for digit recognition tasks, CNNs are more commonly used due to their ability to capture spatial relationships in image data. The example above with LSTM layers is provided for illustrative purposes, and for practical applications, a CNN-based approach is recommended

--

--

Rebeen Hamad

Research Associate @ Newcastle University | PhD in Machine Learning