Recurrent / LSTM layers explained in a simple way
A part of series about different types of layers in neural networks
This post is meant to be read after:
For all the previously introduced layers, the same output will be generated if we repeat the same input several times. For instance, if we have a linear layer with f(x)=2.x. Each time we ask to predict f(3) we will get 6. So if we ask 10 times in a row, predict us the output when the input is 3, the NN will always give 6:
F(3)=6; F(3)=6; F(3)=6; F(3)=6; F(3)=6; …
Now imagine we are training an algorithm to detect repetitions, so we want that F(3) = 0 for the first time (no repetition detected), then we would like to get F(3)= 1 for the second time. We can’t achieve this behavior with non-recurrent layers. Since by definition we will always get the same output for the same input. A hack solution for this is to take a vector of 2 variables, so we can treat the first variable differently than the second variable. So a F([3;0]) =0 (no repetition is detected) but F([3;3])=1 (repetition is detected). The downside of this hack is that we can operate only on a predefined fixed sequence length.
In order to solve the previously introduced problem, recurrent layers were invented. They are a family of layers that contain an internal active state. In the simplest form, we can write Recurrent layers in the following way:
- H is a hidden active internal state that starts usually with 0.
- f is a function that updates the internal state between sequence steps.
- g is another function that uses the current internal state to calculate the output.
- After each input X, H is updated using f, then the output Y of the recurrent NN is generated from this updated state using g.
- So when we send several time the same input X=3 we might get different output Y, because the internal state is changing after each time.
For example, let’s consider the following simple example:
- First we start with H = 0.
- After the First X=3, we get H=3, and output Y=-3+5=2
- After the second X=3, we get H=2*3+3=9 and Y=-9+5 = -4
- After the third X=3, we get H=2*9+3=21 and Y=-21+5 = -16
- If we reset H = 0, then we ask again for X=3 we get H=3, Y=2 again.
And so on, we can see easily here how we can get different outputs with the same input repeated. At any moment, we can reset the internal state (H=0) then the same sequence will be generated.
Basically recurrent networks behave like non-recurrent networks if we reset the internal state after each step (sequence length = 1).
For recurrent-networks, For the same input sequence, we will get the same output sequence (internal state is reset after each sequence, not input). While for non-recurrent networks, for the same input we get the same output.
Recurrent layers are very useful for everything related to sequencing. Where one element of the sequence itself is less important than its position in the sequence.
Consider text processing: If we have the letters: H,O,U,S if we want to predict the next letter, we will predict E. If we have the letters: L,I,S => we will predict T. Although both sequences have the letter S as the last known letter before prediction, but the history of the sequence is more important for the prediction than just the last letter. This sequence’s history is encoded in the internal state H, so when the letter S finally arrives, we will have different histories (or internal state) that will allow us to generate E as the next letter in the first case, and T the next letter in the second case.