[Tensorflow] Fashion-MNIST with Dataset API

Understanding Tensorflow Part 4

Fashion-MNIST intends to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It addresses the problem of MNIST being too easy for modern neural networks, along with some other issues.

Samples from Fashion-MNIST

20180528 Update (Gihub repo with links to all posts and notebooks):

Previous parts of this series:

We’re going to continue using the models from Part 2(GRU) and Part 3(TCN), but replace MNIST with Fashion-MNIST using the Dataset API.

Overview of the Dataset API

Previously we were looping through the MNIST data batches via mnist.train.next_batch class method provided by tf.contrib.learn.python.learn.datasets.mnist and feed the data to the graph via feeds_dict parameter in Session.run. By switching to the Dataset API, we get:

  1. A generic API that works with not only MNIST, but with any datasets. The code in this post can be reused on any other image classification tasks.
  2. Datasets and their iterators defined within the graph. So we can use tf.placeholder and iterator initializers to customize them per session, or even per session run.
  3. Support for tensors, Numpy arrays, TFRecords, and text files (including CSV files) as inputs.
  4. Support for multi-process data preprocessing.
  5. Support for multi-worker distributed training via sharding.

The two fundamental abstractions of the Dataset API are:

  1. tf.data.Dataset: represents sequence of elements, in which each element contains one or more Tensor objects. Some manipulations of the datasets (e.g. batch, shuffle) also return a Dataset object. You can also define some data transformations via the map method.
  2. tf.data.Iterator: extracts elements from the Dataset via Iterator.get_next() method. There are several kinds of iterators to choose from. The simplest one is “one-shot iterator”, which basically just iterates through a Dataset once.

They are roughly analogous to torch.utils.data.Dataset and torch.utils.data.DataLoader in PyTorch. Though batching, shuffling, parallelism configuration are done in DataLoader in PyTorch instead of in Dataset.

Resources

These are the main resources I used when researching for this post:

Interestingly, in “How to use Dataset in Tensorflow”, the author did not cover Feedable Iterator because he did not think it is useful. However, I found it is quite useful in our situation where we need to evaluate the validation set once every few hundred steps. So this post could be used to fill in the missing part of that post.

Importing Fashion-MNIST

Now comes the real deal. As always, the code is hosted on Google Colab:

LINK TO THE CUDNN GRU NOTEBOOK
LINK TO THE TCN NOTEBOOK

Download the Dataset

We use the CSV files from Kaggle Dataset. To download it to the Google Colab environment, I used gsutil to download from a Google Cloud Storage bucket I created (you have to create your own to run). If you don’t have Google Cloud access, I suggest uploading from your local filesystem.

Read the Dataset and Create Train/Validation Split

Because this is a small dataset, we can safely read everything into memory:

df_train = pd.read_csv("fashion-mnist_train.csv")
df_test = pd.read_csv("fashion-mnist_test.csv")

And choose 10,000 images randomly as the validation set (note that the test set also has 10,000 images):

idx = np.arange(60000)
np.random.shuffle(idx)
print(idx[:5])
df_val = df_train.iloc[idx[:10000]]
df_train = df_train.iloc[idx[10000:]]

Define the Dataset in the Graph

First of all, we group all dataset-related definition into one scope:

with tf.variable_scope("datasets"):
...

So they are displayed nicely as one block in Tensorboard:

Next we use tf.placeholder to define configurable batch sizes (or you can use fixed batch sizes as in the comment):

training_batch_size = tf.placeholder(tf.int64) 
# tf.constant(32, dtype="int64")
inference_batch_size = tf.placeholder(tf.int64)
# tf.constant(500, dtype="int64")

Then we directly create the dataset from the Pandas data frames (the back-end Numpy arrays, to be precise):

def process_batch(batch_x, batch_y):
return (
tf.expand_dims(batch_x, -1),
tf.one_hot(batch_y, num_classes))
fminst_ds_train = tf.data.Dataset.from_tensor_slices(
(df_train.iloc[:, 1:].astype("float32") / 255,
df_train.iloc[:, 0].astype("int32"))
).shuffle(
50000, reshuffle_each_iteration=True
).repeat().batch(training_batch_size).map(process_batch)
fminst_ds_val = tf.data.Dataset.from_tensor_slices(
(df_val.iloc[:, 1:].astype("float32") / 255,
df_val.iloc[:, 0].astype("int32"))
).repeat().batch(inference_batch_size).map(process_batch)
fminst_ds_test = tf.data.Dataset.from_tensor_slices(
(df_test.iloc[:, 1:].astype("float32") / 255,
df_test.iloc[:, 0].astype("int32"))
).repeat().batch(inference_batch_size).map(process_batch)

Again, this is because this dataset is very small. For medium-size datasets, you might want to use tf.placeholder to create datasets with Numpy arrays. For bigger datasets, you’ll have to use tf.data.TextLineDataset or tf.data.TFRecordDataset.

The training set is shuffled randomly at each iteration/step. Set the buffer size to be larger or equal to the size of the dataset to make sure it is completely shuffled.

We use .repeat() to make the dataset repeat indefinitely. We’ll control how many iterations/steps we need outside of the graph. To make it iterate only for N epochs, use .repeat(N) and use tf.errors.OutOfRangeError to detect the depletion of data.

We use process_batch function to transform the imported tensors. The first transformation tf.expand_dims to reshape the feature from (batch_size, length) to (batch_size, length, 1). The second transformation performs tf.one_hot(one-hot encoding) on the target labels. Note that both transformation were Tensorflow functions (starts with tf.), as recommended by the official documentation. If you want to do transformation that depends on third-party libraries (e.g. OpenCV), you need to use tf.py_func to wrap the call.

Define Feedable Iterator in the Graph

It is pretty much the same as in the documentation:

handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(
handle,
fminst_ds_train.output_types,
fminst_ds_train.output_shapes)
train_iterator = fminst_ds_train.make_initializable_iterator()
val_iterator = fminst_ds_val.make_initializable_iterator()
test_iterator = fminst_ds_test.make_initializable_iterator()

To use this iterator in a session, we need to initialize the base iterators first. This also set the batch sizes for each iterator:

sess.run([
train_iterator.initializer, val_iterator.initializer,
test_iterator.initializer],
feed_dict={
training_batch_size: batch_size,
inference_batch_size: 500})

And collect the string handles:

train_handle, val_handle, test_handle = sess.run([
train_iterator.string_handle(),
val_iterator.string_handle(),
test_iterator.string_handle()])

Then tell Tensorflow which iterator you want to use when training or testing:

# Training
sess.run([train_op], feed_dict={handle: train_handle})
# Testing (on validation set)
sess.run([accuracy, loss], feed_dict={handle: val_handle})

Fetch the Next Batch in the Graph

This final step connect the dataset to the rest of the graph:

X_0, Y = iterator.get_next()
X = tf.reshape(X_0, (-1, timesteps, num_input))

And we’re done! The model is ready to be trained.

Learning Curves in Tensorboard

Here’s a trick to track a metric in both training and validating stages in one plot — Creating two tf.summary.FileWriter instances that write to two different sub-folders:

train_writer = tf.summary.FileWriter(
"logs/fminst_gru/%s/train" % datetime.now().strftime(
"%Y%m%d_%H%M"), graph)
val_writer = tf.summary.FileWriter(
"logs/fminst_gru/%s/val" % datetime.now().strftime(
"%Y%m%d_%H%M"))

And use the same metric name for both writers:

# In Graph (train)
tf.summary.scalar('Loss', ema_loss)
# Not in Graph (validation)
val_loss = np.mean(val_loss)
val_acc = np.mean(val_acc)
summary.value.add(tag='Loss', simple_value=val_loss)

(I keep an exponential moving average of the training loss in the graph. It is a leftover from my experiments with Estimator API. Spoiler: I don’t like that API.) The latter part shows you how to add values outside of the graph to the Tensorboard.

Then you’ll have both curve in one plot:

Orange: train Blue: validation

We can also compare curves from different runs. For example, we can see that permuted sequential Fashion-MNIST is harder from the following plot:

Dark blue: validation(raw) Light blue: validation(permuted)

(Permuted) Sequential Fashion-MNIST

Sample results(accuracies) of the CudnnGRU models taken from the notebook:

  • Raw: 0.887
  • Permuted: 0.850

Sample results(accuracies) of the TCN models taken from the notebook:

  • Raw: 0.895
  • Permuted: 0.881

Generally TCN still performs better than GRU. But bear in mind that these models are not really tuned, so there might be some rooms for improvement. As suggested by the submitted benchmarks in the project README, adding dropouts to the GRU is likely to help with the accuracy. You can also explore more benchmarks with scikit-learn models here:

Thank You

Thank you very much for reading. This is the last part of this series and the end of my Tensorflow crash course. There are still some missing pieces of the puzzle, e.g. higher level training APIs other than Keras. I played with Estimator and Experiment APIs a bit and found them really restricting. I’d rather write my own training process. For more layer abstractions and data manipulation helpers, TensorLayer seems to be a good Tensorflow medium-level library that is quite popular. I recommend you to quickly browse through their official examples to see if it fits your needs.