[Tensorflow] Building RNN Models to Solve Sequential MNIST

Understanding Tensorflow Part 2

Ceshine Lee
Mar 30, 2018 · 7 min read

In this post, we’re going to lay some groundwork for the custom model which will be covered in the next post by familiarizing ourselves with using RNN models in Tensorflow to deal with the sequential MNIST problem. The basic framework of the code used in this post is based on the following two notebooks:

  1. Aymeric Damien’s Recurrent Neural Network Example

I’ve put the source code for this post in a notebook hosted on Google Colaboratory, which kindly provides a free GPU runtime for the public to useI kept getting disconnected to the runtime when running the notebook. So some of the model training was not completed. You can copy the notebook and run it yourself.):


The notebook should have done most of the talking. The following sections of this post will discuss some parts of the notebook in more detail, and also provide some additional information that was left out in the notebook.

20180528 Update (Gihub repo with links to all posts and notebooks):


Every example from the MNIST dataset is a 28x28 image. We are going to apply recurrent neural network on it in two ways:

  1. Row-by-row: The RNN cells are seeing the ith row of the image in the ith step, that is, a vector of size 28. The total number of time steps is 28.
Row-by-row sequential MNIST (plot taken from Sungjoon’s notbook)

The pixel-by-pixel case is a lot harder because a decent model has to keep a very long-term memory.

We’re going to build four models (two models for each case):

  1. First we replicate the exact same model from Aymeric Damien’s notebook, which uses BasicLSTMCell class to build the LSTM layer.

Improving the BasicLSTMCell model

We’re jumping directly to the second model, which is different from the first model in the following ways:

  1. Use LSTMBlockCell, which should be faster than BasicLSTMCell

I’m going to discuss some of them in the following sections.


This Tensorflow LSTM benchmark is very comprehensive:

My takeaways:

  • For plain LSTM, you usually want to use CudnnLSTM, or LSTMBlockFused if you don’t have GPU access.

Tensorflow has a nice wrapper that does variational dropout for you:

lstm_cell = rnn.DropoutWrapper(
rnn.LSTMBlockCell(num_hidden, forget_bias=1.0),

That’s probably the main reason why you sometimes want to use LSTMBlockCell instead of CudnnLSTM. For sequential MNIST the problem of overfitting is relatively low, so we did not use any dropouts in the notebook.

Dynamic RNN vs Static RNN

I feel the difference between dynamic_rnn and static_rnn is somewhat vague in the documentation. These two discussion threads (stackoverflow and github) cleared things up a bit for me. The main difference seems to be that dynamic_rnn supports dynamic maximum sequence length in batch level, while static_rnn doesn’t. From what I’ve read, there seems to be little reason not to always use dynamic_rnn.

You simply supply the whole batch of input data as a tensor to dynamic_rnn instead of slicing them into a list of tensor (sequences). This is easier to write and read than static_rnn:

# input shape: (batch_size, length, channels)
# Static RNN
x = tf.unstack(x, timesteps, 1)
lstm_cell = rnn.BasicLSTMCell(num_hidden, forget_bias=1.0)
# Dynamic RNN
outputs, _ = tf.nn.dynamic_rnn(
cell=lstm_cell, inputs=x, time_major=False,


In the first model, you have to define the weight and the bias for the linear (output) layer manually:

weights = {
'out': tf.Variable(tf.random_normal(
[num_hidden, num_classes]))
biases = {
'out': tf.Variable(tf.random_normal([num_classes]))

And calculate the output logits by doing a matrix multiplication and an addition:

return tf.matmul(outputs[-1], weights['out']) + biases['out']

Albeit very good for educational purpose, you probably don’t want to do it every time you need a linear layer. The abstraction provided by tf.layers.Dense provides similar experience to nn.linear layer in PyTorch:

output_layer = tf.layers.Dense(
num_classes, activation=None,
return output_layer(
tf.layers.batch_normalization(outputs[:, -1, :]))

You can also use the shortcut function like I just did with tf.layers.batch_normalization :

return tf.layers.dense(
tf.layers.batch_normalization(outputs[:, -1, :]),
num_classes, activation=None,

RMSProp and Gradient Clipping

RMSProp speeds up the convergence, and gradient clipping helps dealing with the exploding gradient problem of RNNs.

loss_op = tf.reduce_mean(
logits=logits, labels=Y))
optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate)
# Get the gradients
gvs = optimizer.compute_gradients(loss_op)
# Clip gradients (except gradients from the dense layer)
capped_gvs = [
(tf.clip_by_norm(grad, 2.), var) if not
var.name.startswith("dense") else (grad, var)
for grad, var in gvs]
# Apply Gradients (Update Trainable Variables)
train_op = optimizer.apply_gradients(capped_gvs)

Pixel-by-Pixel Sequential MNIST

The row-by-row only involves 28 time steps, and is fairly easy to solve with a wide range of hyper-parameters (initialization methods, number of hidden units, learning rate, etc.). The pixel-by-pixel MNIST with 784 time steps is a lot harder to crack. Unfortunately I could not find a set of hyper-parameters for a LSTM model that could guarantee converge. Instead, I’ve found GRU models much easier to tune and succeed to reach 90%+ test accuracy in multiple cases.


PyTorch uses CuDNN implementations of RNNs by default, and that’s what makes it faster. We could also utilize those implementations in Tensorflow via tf.contrib.cudnn_rnn:

# X shape (batch_size, length, channels)
gru = tf.contrib.cudnn_rnn.CudnnGRU(
1, num_hidden,
outputs, _ = gru(tf.transpose(x, (1, 0, 2)))

RNN classes from the tf.contrib.cudnn_rnn module doesn’t have a time_major parameter, so the input shape is always (length, batch_size, channels). Moreover, if you want to get the most speed, let CudnnGRU run through the whole sequence in a single command (as the code above did) instead of feeding it step-by-step. It seems to work similarly to dynamic_rnn, meaning the maximum length is allow to differ between batches.


Grouping variables and operations using tf.variable_scope brought us this modularized graph in Tensorboard:

I’ve also save the raw and clipped gradient every 250 steps. We can use those histograms to determine which threshold we should use:

A lot of gradients were clipped in the above example. So we might want to move the threshold from 0.5 to 1.0 to speed things up.

Permuted Pixel-by-Pixel Sequential MNIST

This is quite simply applying a fixed permutation on every incoming sequence. We’re not able to see a straight horizontal line as a all-one sub-sequence anymore. The purpose is to make the problem even harder.

The Permutation

By utilizing tf.gather :

# Set seed to ensure we have the same permutation
permute = np.random.permutation(784)
X = tf.gather(X_, permute, axis=1)
tf.gather [source]

Remember to use a different (Python) variable name, because you’re going to pass the input to the placeholder (previously named as X, now X_). Using the same name will make Tensorflow replace the permuted sequences in the graph with your input, and the results will not be permuted. (I should probably use a more distinguishable name than X_)

What’s next

Now we’re familiar with how to deal with sequential MNIST with Tensorflow and the basic use of some RNN classes. In the next post we’ll learn how to use tf.layers APIs to write our customized layers, and implement Temporal Convolutional Networks (TCN) in Tensorflow.

Quick Links


Towards human-centered AI. https://veritable.pw

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store