Long Short-Term Memory (LSTM): Concept
LSTM is a recurrent neural network (RNN) architecture that REMEMBERS values over arbitrary intervals. LSTM is well-suited to classify, process and predict time series given time lags of unknown duration. Relative insensitivity to gap length gives an advantage to LSTM over alternative RNNs, hidden Markov models and other sequence learning methods.
The structure of RNN is very similar to hidden Markov model. However, the main difference is with how parameters are calculated and constructed. One of the advantage with LSTM is insensitivity to gap length. RNN and HMM rely on the hidden state before emission / sequence. If we want to predict the sequence after 1,000 intervals instead of 10, the model forgot the starting point by then. LSTM REMEMBERS.
What is the architecture which allows LSTM to REMEMBER?
RNN cell takes in two inputs, output from the last hidden state and observation at time = t. Besides the hidden state, there is no information about the past to REMEMBER.
The long-term memory is usually called the cell state. The looping arrows indicate recursive nature of the cell. This allows information from previous intervals to be stored with in the LSTM cell. Cell state is modified by the forget gate placed below the cell state and also adjust by the input modulation gate. From equation, the previous cell state forgets by multiply with the forget gate and adds new information through the output of the input gates.
The remember vector is usually called the forget gate. The output of the forget gate tells the cell state which information to forget by multiplying 0 to a position in the matrix. If the output of the forget gate is 1, the information is kept in the cell state. From equation, sigmoid function is applied to the weighted input/observation and previous hidden state.
The save vector is usually called the input gate. These gates determine which information should enter the cell state / long-term memory. The important parts are the activation functions for each gates. The input gate is a sigmoid function and have a range of [0,1]. Because the equation of the cell state is a summation between the previous cell state, sigmoid function alone will only add memory and not be able to remove/forget memory. If you can only add a float number between [0,1], that number will never be zero / turned-off / forget. This is why the input modulation gate has an tanh activation function. Tanh has a range of [-1, 1] and allows the cell state to forget memory.
The focus vector is usually called the output gate. Out of all the possible values from the matrix, which should be moving forward to the next hidden state?
The working memory is usually called the hidden state. What information should I take to the next sequence? This is analogous to the hidden state in RNN and HMM.
The first sigmoid activation function is the forget gate. Which information should be forgotten from the previous cell state (Ct-1). The second sigmoid and first tanh activation function is our input gate. Which information should be saved to the cell state or should be forgotten? The last sigmoid is the output gate and highlights which information should be going to the next hidden state.