Exploring Neural Networks with fashion MNIST

8 min readJan 29, 2019

Full code available in this notebook.

In this post, we’ll introduce the fashion MNIST dataset, show how to train simple 3, 6 and 12-layer neural networks, then compare the results with different epochs and finally, visualize the predictions.

Introducing fashion MNIST

The MNIST database of handwritten digits is one of the most widely used data sets used to explore Neural Networks and became a benchmark for model comparison. More recently, Zalando research published a new dataset, with 10 different fashion products. Called fashion MNIST, this dataset is meant to be a replacement for the original MNIST which turned out to be too easy for machine learning folks; even linear classifiers were able to achieve high classification accuracy. The new dataset promises to be more challenging, so that machine learning algorithms have to learn more advanced features to correctly classify the images.

The fashion MNIST dataset can be accessed from the Github repository here. It contains 70,000 greyscale images in 10 categories. The images show individual articles of clothing at low resolution (28x28px). Below we can see a sample of 25 images with their labels.

What the first 25 images of the training set look like.

For this experiment, I’ll be using TensorFlow and Keras — a machine learning framework and a high-level API to build and train models. If you haven’t installed TensorFlow yet and set up your environment, their instructions are easy to follow.

Loading and Exploring the Data

The data can be loaded straight from Keras into a Training set (60,000 images) and Testing set (10,000 images). The images are 28x28 arrays with pixel values 0 to 255, and the labels are an array of integers 0 to 9, representing 10 classes of clothing.

fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() # class names are not included, need to create them to plot the images  
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

We can see the training data was stored in an array of shape (60000,28,28) and testing data in an array of shape (10000,28,28).

print("train_images:", train_images.shape)
print("test_images:", test_images.shape)train_images: (60000, 28, 28)
test_images: (10000, 28, 28)

We can also take a closer inspection at one of the images, let’s say the first image, which looks like an ankle boot.

# Visualize the first image from the training dataset
plt.figure()
plt.imshow(train_images[0])
plt.colorbar()
plt.grid(False)

The next step is to normalize the data dimensions so that they are approximately the same scale.

# scale the values to a range of 0 to 1 of both data sets
train_images = train_images / 255.0
test_images = test_images / 255.0

Training the first NN model

Training a Neural Network (NN) requires 4 steps:

Step 1 — Build the architecture
Step 2 — Compile the model
Step 3 — Train the model
Step 4 — Evaluating the model

Step 1 — Build the architecture

First, we’ll design the NN architecture by deciding the number of layers and activation functions. We’ll start with a simple 3-layer Neural Network. In the first layer we ‘flatten’ the data, so that a (28x28) shape flattens to 784. The second layer is a dense layer with a ReLu activation function and has 128 neurons. The last layer is a dense layer with a softmax activation function that classifies the 10 categories of the data and has 10 neurons.

# Model a simple 3-layer neural network
model_3 = keras.Sequential([
    keras.layers.Flatten(input_shape=(28,28)),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.softmax)
])
model_3.summary()

The model summary table provides a nice visualization of the network architecture and parameters.

Summary table of the network architecture and parameters of a 3-layer neural network.

Step 2 — Compile the model

Next, we compile the model with the following settings:

Loss function — calculates the difference between the output and the target variable. It measures the accuracy of the model during training and we want to minimize this function. In this example, we chose the sparse_categorical_crossentropy loss function. Cross-entropy is the default loss function to use for a multi-class classification problem and it's sparse because our targets are not one-hot encodings but are integers.
Optimizer — how the model is updated and is based on the data and the loss function. Adam is an extension to the classic stochastic gradient descent and is popular because it's shown to be effective and efficient.
Metrics — monitors the training and testing steps. Accuracy is a common metric and it measures the fraction of images that are correctly classified.

model_3.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

For a full list of model settings, see the Keras documentation.

Step 3 — Train the model

Next, we train the model by fitting it to the training data, so we give it the input (images) and expected output (labels). Here, an important step to minimize overfitting is validation. There are a few ways to validate, in this case, we use the automatic validation built into the function, where we set the validation_split on the training data. Here we use an 80/20 split: 80% for training and 20% for validating.

model_3.fit(train_images, train_labels, epochs=5, validation_split=0.2)

We also need to define how many times the network will be trained, this is an epoch. It’s an arbitrary cutoff and here we choose 5 epochs.

Wait, do you mean an iteration? what’s an epoch?
Epoch — one forward pass and one backward pass of all the training examples
Iteration — number of passes, one forward pass and one backward pass
Example: if you have 1,000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.

Training will show you the following results per epoch, note that with each epoch, the loss decreases and the accuracy increases, meaning our model is improving.

Step 4 — Evaluate the model

Now that we’ve set up and trained our model, we need to evaluate its performance. This is done on a test dataset, new data that the model hasn’t seen yet. We have to make sure to separate our training and validating dataset from our testing dataset.

test_loss, test_acc = model_3.evaluate(test_images, test_labels)

Here we can print the evaluation metrics of our model - loss and accuracy.

Evaluating a NN-3 with 5 epochs

How good is our model? We can see that the test loss is 35.7 and accuracy is 88.1 for this neural network, which is pretty close to the training metrics at the 5th epoch.

Is a deeper neural network more accurate?

Next, we’ll compare the classification accuracy between two depths, a 3-layer Neural Networks (NN-3), a 6-layer Neural Network (NN-6) and a 12-layer Neural Network (NN-12), to see if more layers mean higher accuracies.

Let’s build a 6-layer network, by adding 3 more hidden layers keeping the same activation functions, shapes and settings, so the only difference is the depth of the network. Here we’ll combine all our steps into the following code block:

# Model a simple 6-layer neural network 
model_6 = keras.Sequential([
    keras.layers.Flatten(input_shape=(28,28)),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.softmax)
])model_6.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])model_6.fit(train_images, train_labels, epochs=5, validation_split=0.2)test_loss, test_acc = model_6.evaluate(test_images, test_labels)
print("Model - 6 layers - test loss:", test_loss * 100)
print("Model - 6 layers - test accuracy:", test_acc * 100)

Training and Evaluating results of a NN-6 with 5 epochs

With the NN-6, our test loss is slightly higher at 37.5 and the test accuracy is slightly lower at 86.6 than the NN-3. So our model got a little worse.

Now, let’s try a 12-layer Neural Network, the code is the same, we just add 6 more hidden layers, keeping all the other variables the same. After compiling, and training on 5 epochs, we get the following results:

Training and Evaluating results of a NN-12 with 5 epochs

With the NN-12, our test loss increased to 39.3 and our accuracy is pretty much the same at 86.2. Remember that we want the loss to be small, and the accuracy to be as large as possible, so this network performed a little worse than above.

The overall trend with the increasing layer size seems to be that the loss function is increasing, and the accuracy is slightly decreasing.

Will our predictions improve as the epochs increase?

Of course, we can say that to improve our accuracy, we need much more than 5 epochs, but will that improve our models? As we re-trained our models with 50 epochs, we see the following loss and accuracy for each model.

For all 3 models, the general trend we notice as the epochs increase with the training data set is that the loss is decreasing down to 0, and the accuracy is increasing up to 1, both representing the ‘perfect score’. This is a sign of over-fitting, which is the motivation behind validating the model. The validation shows that the loss function increases past 0.50 for NN-3 and NN-6, and slope stabilizes at around 0.40 for the NN-12. The accuracy slope stabilizes between 0.88 and 0.90 for all 3 models. So it looks like the 12-layer NN is performing better on the validation set. Thus, these images visualize the importance of the validation step.

Regarding the number of epochs, we can see that the trend either stabilizes or gets worse after 5 epochs, so we can conclude that for a simple problem, 5 epochs should be enough to train the models. This is a well know problem in machine learning, called Overfitting — when the model adapts too well to a specific dataset and thus does not generalize well on new information. There are many ways to correct for this, called Regularization, one of which we’ve just illustrated called Early Stopping.

Visualizing the predictions

Now we can apply the trained models to classify new fashion images — the test data set. We give it an unseen set of images and ask it to apply one of the 10 labels, identifying the piece of clothing.

predictions = model_12.predict(test_images)

Here we visualize 15 data points, with the labelled images, and the probability graph beside them. If the label is red, that means the prediction did not match the true label; otherwise, it’s purple. It looks like our NN-12 model got 2 images wrong and 13 images correct.