An LSTM Odyssey
This week I read LSTM: A Search Space Odyssey. It’s an excellent paper that systematically evaluates the different internal mechanisms of an LSTM (long short-term memory) block by disabling each mechanism in turn and comparing their performance. We’re going to implement each of the variants in TensorFlow and evaluate their performance on the Penn Tree Bank (PTB) dataset. This will obviously not be as thorough as the original paper but it allows us to see, and try out, the impact of each variant for ourselves.
TL;DR Check out the Github repo for results and variant definitions.
We’ll start with a setup similar to TensorFlow’s RNN tutorial. The primary difference is that we’re going to use a very simple re-implementation for the LSTM cell defined as follows:
This corresponds to the “vanilla” LSTM from the paper. Each equation defines a particular component of the block: block input (z), input gate (i), forget gate (f), cell state (c), output gate (o) and block output (y). Both g and h represent the hyperbolic tangent function and sigma represents the sigmoid activation function. The circle dot represents element-wise multiplication.
Here’s the same thing in code:
Be sure to check out the full source for the rest of the cell definition. Mostly we create a new class inheriting from RNNCell and use the above code as the body of __call__. The nice part about this setup is that we can utilize MultiRNNCell to stack the LSTMs into multiple layers.
Notice that we initialize all of our parameters using get_variable. This is necessary so that we can reuse these variables for each time step rather than creating new parameters at each step. Also, all parameters are transposed from the paper’s definitions to avoid additional graph operations.
Then we define each equation as operations in the graph. Many of the operations have reversed inputs from the equations so that the matrix multiplications produce the correct dimensionality. Other than these details we’re directly translating the equations.
Note that from a performance perspective, this is a naïve implementation. If you look at the source for TensorFlow’s LSTMCell you’ll see that all of the cell inputs and states are concatenated together before doing any matrix multiplication. This is to improve performance, however, since we’re more interested in taking the LSTM apart, we’ll keep things simple.
Running this vanilla LSTM on the included notebook we obtain a test perplexity (e^cost) of less than 100. So far so good. This will serve as our baseline to compare to the other variants. Below is the cost (average negative log probability of the target words) on the validation set after each epoch:
The most helpful bits for implementing each of the variants can be found in appendix A3 of the paper. The gate omission variants such as no input gate (NIG), no forget gate (NFG), and no output gate (NOG) simply set their respective gates to 1 (be sure to use floats, not integers, here):
The no input activation function (NIAF) and no output activation function (NOAF) variants remove their input or output activation functions, respectively:
The no peepholes (NP) variant removes peepholes from all three gates:
The coupled input-forget gate (CIFG) variant sets the forget gate like so:
The final variant, full gate recurrence (FGR), is the most complex, essentially allowing each gate’s previous state to interact with each gate’s next state:
In many of the variants, we can remove parameters no longer needed to compute the cell. The FGR variant, however, adds significantly more parameters (9 additional square matrices) which also increases training time.
To implement each, we’ll simply duplicate our vanilla LSTM cell implementation and make the necessary modifications for the variant. There are too many to show here but you can view the full source for each variant on Github. To train each, we’ll use the same hyperparameters from the vanilla LSTM trial. This probably isn’t fair and a more thorough analysis (as performed in the paper) would try to find the best hyperparameters for each variant.
The NFG and NOG variants fail to converge to anything useful while the NIAF variant diverges significantly after around the 8th epoch. (This divergence could probably be fixed with learning rate decay which I omitted for simplicity.)
In contrast, the NIG, CIFG, NP and FGR variants all converge. The NIG and FGR variants do not produce great results while the NP and CIFG variants perform similarly to the vanilla LSTM.
Finally the NOAF variant. Its poor performance is likely due to the lack of clamping from the output activation function so its cost explodes:
Here are the test perplexities for each variant:
Overall it’s been fun dissecting the LSTM. Feel free to try out the code yourself and if you’re interested in taking this further I recommend running comparisons with GRUs, looking at fANOVA or extending what’s here with more thorough analysis.