[Tensorflow] Building RNN Models to Solve Sequential MNIST

Understanding Tensorflow Part 2

In this post, we’re going to lay some groundwork for the custom model which will be covered in the next post by familiarizing ourselves with using RNN models in Tensorflow to deal with the sequential MNIST problem. The basic framework of the code used in this post is based on the following two notebooks:

  1. Aymeric Damien’s Recurrent Neural Network Example
  2. Sungjoon’s Sequence classification with LSTM

I’ve put the source code for this post in a notebook hosted on Google Colaboratory, which kindly provides a free GPU runtime for the public to useI kept getting disconnected to the runtime when running the notebook. So some of the model training was not completed. You can copy the notebook and run it yourself.):

LINK TO THE NOTEBOOK ON GOOGLE COLAB

The notebook should have done most of the talking. The following sections of this post will discuss some parts of the notebook in more detail, and also provide some additional information that was left out in the notebook.

20180528 Update (Gihub repo with links to all posts and notebooks):

Overview

Every example from the MNIST dataset is a 28x28 image. We are going to apply recurrent neural network on it in two ways:

  1. Row-by-row: The RNN cells are seeing the ith row of the image in the ith step, that is, a vector of size 28. The total number of time steps is 28.
  2. Pixel-by-pixel: The RNN cells are seeing the ith pixel (a single number, row-first order) in the ith step. The total number of time steps is 28*28 = 784.
Row-by-row sequential MNIST (plot taken from Sungjoon’s notbook)

The pixel-by-pixel case is a lot harder because a decent model has to keep a very long-term memory.

We’re going to build four models (two models for each case):

  1. First we replicate the exact same model from Aymeric Damien’s notebook, which uses BasicLSTMCell class to build the LSTM layer.
  2. Refactor the first model, replace BasicLSTMCell with LSTMBlockCell, and add some scaffoldding that should help us debug and tune the model later.
  3. We can further increase the speed of the LSTM layer by using CudnnGRU instead, as running long sequences from the pixel-by-pixel approach will drag down performance significantly. The Tensorboard support is also added.
  4. Finally we use the exact same model from (3) on the permuted sequential MNIST, which shuffles the order of the pixels and makes the problem even harder.

Improving the BasicLSTMCell model

We’re jumping directly to the second model, which is different from the first model in the following ways:

  1. Use LSTMBlockCell, which should be faster than BasicLSTMCell
  2. Replace rnn.dynamic_rnn with rnn.static_rnn. (So no need to unstack the tensor.)
  3. Replace manual weight definitions with tf.layers.Dense
  4. Replace tf.nn.softmax_cross_entropy_with_logits with tf.nn.softmax_cross_entropy_with_logits_v2
  5. Group graph definition together
  6. Add a batch_normalization layer between LSTM and Dense layers.
  7. Add gradient clipping for RNN gradient
  8. Add a checkpoint saver
  9. Evaluate test accuracy every N steps (BAD PRACTICE: use a validation set instead) — this will be fixed once we reach the part where we use Dataset APIs to import a new dataset.
  10. Replace GradientDescentOptimizer with RMSPropOptimizer
  11. Use tf.set_random_seed to control randomness

I’m going to discuss some of them in the following sections.

LSTMBlockCell

This Tensorflow LSTM benchmark is very comprehensive:

My takeaways:

  • For plain LSTM, you usually want to use CudnnLSTM, or LSTMBlockFused if you don’t have GPU access.
  • If you want to do some operations between time steps like variational dropout, use LSTMBlock.
  • Use StandardLSTM only if you know what you’re doing.
  • You should probably never have any reason to use BasicLSTM.

Tensorflow has a nice wrapper that does variational dropout for you:

lstm_cell = rnn.DropoutWrapper(
rnn.LSTMBlockCell(num_hidden, forget_bias=1.0),
input_keep_prob=0.5,
output_keep_prob=0.5,
state_keep_prob=0.5,
variational_recurrent=True,
dtype=tf.float32
)

That’s probably the main reason why you sometimes want to use LSTMBlockCell instead of CudnnLSTM. For sequential MNIST the problem of overfitting is relatively low, so we did not use any dropouts in the notebook.

Dynamic RNN vs Static RNN

I feel the difference between dynamic_rnn and static_rnn is somewhat vague in the documentation. These two discussion threads (stackoverflow and github) cleared things up a bit for me. The main difference seems to be that dynamic_rnn supports dynamic maximum sequence length in batch level, while static_rnn doesn’t. From what I’ve read, there seems to be little reason not to always use dynamic_rnn.

You simply supply the whole batch of input data as a tensor to dynamic_rnn instead of slicing them into a list of tensor (sequences). This is easier to write and read than static_rnn:

# input shape: (batch_size, length, channels)
# Static RNN
x = tf.unstack(x, timesteps, 1)
lstm_cell = rnn.BasicLSTMCell(num_hidden, forget_bias=1.0)
# Dynamic RNN
outputs, _ = tf.nn.dynamic_rnn(
cell=lstm_cell, inputs=x, time_major=False,
dtype=tf.float32)

tf.layers.Dense

In the first model, you have to define the weight and the bias for the linear (output) layer manually:

weights = {
'out': tf.Variable(tf.random_normal(
[num_hidden, num_classes]))
}
biases = {
'out': tf.Variable(tf.random_normal([num_classes]))
}

And calculate the output logits by doing a matrix multiplication and an addition:

return tf.matmul(outputs[-1], weights['out']) + biases['out']

Albeit very good for educational purpose, you probably don’t want to do it every time you need a linear layer. The abstraction provided by tf.layers.Dense provides similar experience to nn.linear layer in PyTorch:

output_layer = tf.layers.Dense(
num_classes, activation=None,
kernel_initializer=tf.orthogonal_initializer()
)
return output_layer(
tf.layers.batch_normalization(outputs[:, -1, :]))

You can also use the shortcut function like I just did with tf.layers.batch_normalization :

return tf.layers.dense(
tf.layers.batch_normalization(outputs[:, -1, :]),
num_classes, activation=None,
kernel_initializer=tf.orthogonal_initializer()
)

RMSProp and Gradient Clipping

RMSProp speeds up the convergence, and gradient clipping helps dealing with the exploding gradient problem of RNNs.

loss_op = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits_v2(
logits=logits, labels=Y))
optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate)
# Get the gradients
gvs = optimizer.compute_gradients(loss_op)
# Clip gradients (except gradients from the dense layer)
capped_gvs = [
(tf.clip_by_norm(grad, 2.), var) if not
var.name.startswith("dense") else (grad, var)
for grad, var in gvs]
# Apply Gradients (Update Trainable Variables)
train_op = optimizer.apply_gradients(capped_gvs)

Pixel-by-Pixel Sequential MNIST

The row-by-row only involves 28 time steps, and is fairly easy to solve with a wide range of hyper-parameters (initialization methods, number of hidden units, learning rate, etc.). The pixel-by-pixel MNIST with 784 time steps is a lot harder to crack. Unfortunately I could not find a set of hyper-parameters for a LSTM model that could guarantee converge. Instead, I’ve found GRU models much easier to tune and succeed to reach 90%+ test accuracy in multiple cases.

CudnnGRU

PyTorch uses CuDNN implementations of RNNs by default, and that’s what makes it faster. We could also utilize those implementations in Tensorflow via tf.contrib.cudnn_rnn:

# X shape (batch_size, length, channels)
gru = tf.contrib.cudnn_rnn.CudnnGRU(
1, num_hidden,
kernel_initializer=tf.orthogonal_initializer())
outputs, _ = gru(tf.transpose(x, (1, 0, 2)))

RNN classes from the tf.contrib.cudnn_rnn module doesn’t have a time_major parameter, so the input shape is always (length, batch_size, channels). Moreover, if you want to get the most speed, let CudnnGRU run through the whole sequence in a single command (as the code above did) instead of feeding it step-by-step. It seems to work similarly to dynamic_rnn, meaning the maximum length is allow to differ between batches.

Tensorboard

Grouping variables and operations using tf.variable_scope brought us this modularized graph in Tensorboard:

I’ve also save the raw and clipped gradient every 250 steps. We can use those histograms to determine which threshold we should use:

A lot of gradients were clipped in the above example. So we might want to move the threshold from 0.5 to 1.0 to speed things up.

Permuted Pixel-by-Pixel Sequential MNIST

This is quite simply applying a fixed permutation on every incoming sequence. We’re not able to see a straight horizontal line as a all-one sub-sequence anymore. The purpose is to make the problem even harder.

The Permutation

By utilizing tf.gather :

# Set seed to ensure we have the same permutation
np.random.seed(100)
permute = np.random.permutation(784)
X = tf.gather(X_, permute, axis=1)
tf.gather [source]

Remember to use a different (Python) variable name, because you’re going to pass the input to the placeholder (previously named as X, now X_). Using the same name will make Tensorflow replace the permuted sequences in the graph with your input, and the results will not be permuted. (I should probably use a more distinguishable name than X_)

What’s next

Now we’re familiar with how to deal with sequential MNIST with Tensorflow and the basic use of some RNN classes. In the next post we’ll learn how to use tf.layers APIs to write our customized layers, and implement Temporal Convolutional Networks (TCN) in Tensorflow.

Quick Links