An LSTM Odyssey

Photo from Unsplash

This week I read LSTM: A Search Space Odyssey. It’s an excellent paper that systematically evaluates the different internal mechanisms of an LSTM (long short-term memory) block by disabling each mechanism in turn and comparing their performance. We’re going to implement each of the variants in TensorFlow and evaluate their performance on the Penn Tree Bank (PTB) dataset. This will obviously not be as thorough as the original paper but it allows us to see, and try out, the impact of each variant for ourselves.

TL;DR Check out the Github repo for results and variant definitions.

Vanilla LSTM

We’ll start with a setup similar to TensorFlow’s RNN tutorial. The primary difference is that we’re going to use a very simple re-implementation for the LSTM cell defined as follows:

LSTM equations from section 2.

This corresponds to the “vanilla” LSTM from the paper. Each equation defines a particular component of the block: block input (z), input gate (i), forget gate (f), cell state (c), output gate (o) and block output (y). Both g and h represent the hyperbolic tangent function and sigma represents the sigmoid activation function. The circle dot represents element-wise multiplication.

Here’s the same thing in code:

Be sure to check out the full source for the rest of the cell definition. Mostly we create a new class inheriting from RNNCell and use the above code as the body of __call__. The nice part about this setup is that we can utilize MultiRNNCell to stack the LSTMs into multiple layers.

Notice that we initialize all of our parameters using get_variable. This is necessary so that we can reuse these variables for each time step rather than creating new parameters at each step. Also, all parameters are transposed from the paper’s definitions to avoid additional graph operations.

Then we define each equation as operations in the graph. Many of the operations have reversed inputs from the equations so that the matrix multiplications produce the correct dimensionality. Other than these details we’re directly translating the equations.

Note that from a performance perspective, this is a naïve implementation. If you look at the source for TensorFlow’s LSTMCell you’ll see that all of the cell inputs and states are concatenated together before doing any matrix multiplication. This is to improve performance, however, since we’re more interested in taking the LSTM apart, we’ll keep things simple.

Running this vanilla LSTM on the included notebook we obtain a test perplexity (e^cost) of less than 100. So far so good. This will serve as our baseline to compare to the other variants. Below is the cost (average negative log probability of the target words) on the validation set after each epoch:

Vanilla cost on the validation set

Variants

The most helpful bits for implementing each of the variants can be found in appendix A3 of the paper. The gate omission variants such as no input gate (NIG), no forget gate (NFG), and no output gate (NOG) simply set their respective gates to 1 (be sure to use floats, not integers, here):

NIG sets i to 1, NFG sets f to 1 and NOG sets o to 1.

The no input activation function (NIAF) and no output activation function (NOAF) variants remove their input or output activation functions, respectively:

NIAF removes the g(x) activation function, while NOAF removes the h(x) activation function.

The no peepholes (NP) variant removes peepholes from all three gates:

For all three gates remove the peepholes.

The coupled input-forget gate (CIFG) variant sets the forget gate like so:

The final variant, full gate recurrence (FGR), is the most complex, essentially allowing each gate’s previous state to interact with each gate’s next state:

Recurrent connections are added for each of the gates.

In many of the variants, we can remove parameters no longer needed to compute the cell. The FGR variant, however, adds significantly more parameters (9 additional square matrices) which also increases training time.

To implement each, we’ll simply duplicate our vanilla LSTM cell implementation and make the necessary modifications for the variant. There are too many to show here but you can view the full source for each variant on Github. To train each, we’ll use the same hyperparameters from the vanilla LSTM trial. This probably isn’t fair and a more thorough analysis (as performed in the paper) would try to find the best hyperparameters for each variant.

Training progress of model variants.

Results

The NFG and NOG variants fail to converge to anything useful while the NIAF variant diverges significantly after around the 8th epoch. (This divergence could probably be fixed with learning rate decay which I omitted for simplicity.)

Diverging variants

In contrast, the NIG, CIFG, NP and FGR variants all converge. The NIG and FGR variants do not produce great results while the NP and CIFG variants perform similarly to the vanilla LSTM.

Converging variants

Finally the NOAF variant. Its poor performance is likely due to the lack of clamping from the output activation function so its cost explodes:

Here are the test perplexities for each variant:

Conclusion

Overall it’s been fun dissecting the LSTM. Feel free to try out the code yourself and if you’re interested in taking this further I recommend running comparisons with GRUs, looking at fANOVA or extending what’s here with more thorough analysis.

Follow me on Twitter for more posts like these. If you’d like help with production NLP, I do consulting.