Estimating the Fashion MNIST using simple neural network

Pradeep Adhokshaja
The Startup
Published in
10 min readMay 9, 2020

--

Neural Network

Neural Networks are a group of algorithms that consist of computational nodes, that take in an input, perform mathematical computations on it, and return an output. Complex mathematical operations can be performed based on the functions we choose to use on these computational nodes. These functions are also called “activations”.

Neural Networks can be used for the purpose of estimating real values(regression) ,categorical variables (classification) and generating data (generative models).

Simple Neural Network

The Fashion MNIST data set and data processing

The Fashion MNIST Data set consists of 60,000 28 pixel by 28 pixel black and white images created by Zalando Research . Each image belongs to one of the following 10 categories. These categories are encoded numerically from 0 to 9.

  • 0 T-shirt/top
  • 1 Trouser
  • 2 Pullover
  • 3 Dress
  • 4 Coat
  • 5 Sandal
  • 6 Shirt
  • 7 Sneaker
  • 8 Bag
  • 9 Ankle boot

Flattened Data Set

The image data is converted to a flattended csv table.The flattened table consists of 60,000 rows (for the 60,000 images) ,784 columns (28 pixel by 28 pixel results in having 784 pixels in total for each image.) and a single column (“Label”) that gives the information on the class each image belongs to. The image data set, together with the flattened table , can be found on Kaggle.

Simple Illustration of flattening an image

The code to convert image data to a flattened csv table can be found here.

The code for importing the flattened data is as follows

import pandas as pd
import numpy as np


np.random.seed(42)
data_train = pd.read_csv('../input/fashionmnist/fashion-mnist_train.csv')data_test = pd.read_csv('../input/fashionmnist/fashion-mnist_test.csv')

Converting Classes to one hot encoded vectors

Before constructing the neural network, it is necessary to convert the classes to a numeric form, in such a way that we do not introduce any order. One way to do this is to use one-hot encoding. Each class is converted to a vector of zeros, that has the length, equal to the number of classes present. For each class, a position, unique to the class,is chosen in the vector. That position is changed to a 1.

For example,

Let’s say that there are three classes

  • Dog
  • Cat
  • Bird

Using one hot encoding, this can be converted to

  • Dog -> [1,0,0]
  • Cat -> [0,1,0]
  • Bird -> [0,0,1]

The following code converts the cloth categories to one-hot encoded vectors and prepares the training and test sets.

one_hot = pd.get_dummies(data['label'].unique())
one_hot['label'] = one_hot.index
data_train = pd.merge(data_train,one_hot)
data_test = pd.merge(data_test,one_hot)
data_train.drop('label',axis=1,inplace=True)
data_test.drop('label',axis=1,inplace=True)

## Create the train and test set
X_train = np.array(data_train.drop([0,1,2,3,4,5,6,7,8,9],axis=1).values)/255y_train = np.array(data_train[[0,1,2,3,4,5,6,7,8,9]].values)X_test = np.array(data_test.drop([0,1,2,3,4,5,6,7,8,9],axis=1).values)/255y_test = np.array(data_test[[0,1,2,3,4,5,6,7,8,9]].values)X_train = X_train.T
y_train = y_train.T
X_test = X_test.T
y_test = y_test.T

Neural Network Architecture

For the purpose of estimating classes, a 3 layered neural network is used. The 3 layers have the following number of units, together with the activations.

  • Input layer : 784 units (For 784 pixels)
  • Hidden Layer : 128 units with sigmoid activation
  • Output Layer : 10 units(For the 10 categories of clothes) with softmax activation
Representation of the Architecture Used

The weight matrices are used to map information from one layer to the other. As seen from the above diagram, the matrices

  • W1, which has the dimension 128 into 784 maps the input data to the hidden layer
  • W2, which has the dimension 10 into 128, maps the hidden layer to the output layer.

The code for initializing the weight matrices:

def sigmoid(x):
return(1./(1+np.exp(-x)))

def softmax(x):
e_x = np.exp(x - np.max(x))

return (e_x / e_x.sum(axis=0))

import random
random.seed(42)
w1 = np.random.rand(128,784)/np.sqrt(784)
b0 = np.zeros((128,1))/np.sqrt(784)
w2 = np.random.rand(10,128)/np.sqrt(128)
b1 = np.zeros((10,1))/np.sqrt(128)

Forward Pass

The weight matrices are randomly initialized and the input data is passed through the network to make a prediction. The equation for the same is as follows

equation that maps input to output

where

  • X is the input matrix with the dimension 784 into 60,000
  • b0 is a vector of biases that is fed into X1. This has the dimension 128 into 1
  • X1 is the output of the hidden layer . The output of X1 has the dimension 128 into 60,000
  • X2 is obtained by multiplying W2 and X1 and adding b1. b1 is a vector of biases that is fed to W2X1. The result is scaled using softmax.

Sigmoid

The sigmoid function maps numbers to the range [0,1]. The function looks like the follows

A “rough” sketch of the sigmoid function with the equation

Softmax

The softmax is a normalization procedure that follows the following equation.

Softmax Normalization

Softmax converts the output to class probabilities. The unit with the highest softmax probability, is converted to 1 and the rest to 0’s. Using one-hot encoded vectors, this output is converted to a category.

For the forward pass, the code can be written in the following way.

x1 = sigmoid(w1@X+b0)
x2 = softmax(w2@x1+b1)

The command @ refers to matrix multiplication.

Backpropagation

Cross Entropy

The results are estimated using the weight matrices W1,W2,b0 and b1. To understand how the attained results deviate from the actual ones, cross entropy is used. Its defined as follows

Cross Entropy Loss for a single observation

where,

  • yi is the one hot encoded vector of the actual class at the given observation
  • pi is the vector of softmax probabilities, attained from the output layer of the neural network

For example,

Let the actual class for a given observation be

  • Dog -> [1,0,0]

And let the output of the neural network be

[1,0,0]

Cross Entropy Loss will then be -1*log(1)-0*log(0)-0*log(0) = 0

But if the output of the neural network is [0,1,0] then the

Cross Entropy loss will be -1*log(0)-0*log(1)-0*log(0) = Infinity

The cross entropy loss penalizes high confidence classifiers that create wrong estimates of the actual class.

Extending the above formula to the entire data set we get,

Cross Entropy Loss for entire Data Set

The code for cross entropy is as follows

cross_entropy = -np.mean(np.multiply(y,np.log(x2)))

Minimizing Cross Entropy Using Gradient Descent

In order to estimate the classes correctly, the neural network has to alter the weight matrices W1,W2,b0 & b1 such that the cross entropy loss is minimized.

One way to minimize this is to use gradient descent. Through gradient descent , a function’s minima can be obtained using the following procedure.

Gradient Descent

Using the above rule, we need to construct the following procedure to get the weight matrices for which loss is minimum

Gradient Descent for Neural Network

Alpha or the learning rate is the speed at which the gradient descent algorithm approaches the minima.

To run gradient descent, the gradient of the loss function needs to be found with respect to the weights W1,W2,b0 & b1. This involves calculating the “errors” of the loss function with respect to each layer from the last layer to the first layer. As this process has an opposite direction to forward pass, this is called “backpropagation” as we are propagating the “errors” backward. The following screenshot gives a overview of backpropagation.

Screenshot of Backpropagation taken from Michael Neilsen’s Blog on Neural Networks

This screenshot is taken from Michael Nielsen’s blog on Neural Networks http://neuralnetworksanddeeplearning.com/chap2.html. This blog covers backpropagation in detail.

Based on the above screen shot we get the gradient with respect to W2 as follows

Derivative of Loss function With respect to W2.

Similarly, the derivative with respect to W1 will be as follows

Derivative of loss function with respect to W1

For the biases b0 and b1, we get

Derivative of Loss functions with respect to biases b0 and b1

The code in python for the derivatives is as follows

delta_2 = (x2-Y)
delta_1 = np.multiply(w2.T@delta_2, np.multiply(x1,1-x1))
dW1 = delta_1@X.T
dW2 = delta_2@x1.T
db0 = np.sum(delta_1,axis=1,keepdims=True)
db1 = np.sum(delta_2,axis=1,keepdims=True)

The function np.multiply() does element wise multiplication, while the function @ refers to matrix multiplication.

Overall Process of Learning Data

Now that we have the process for forward pass, and back propagation, the learning process is as follows:

In an epoch, the entire data set is covered. The data set can be covered either as a whole in an epoch or in batches.

Momentum

In order for the algorithm to converge to a minima, another parameter is used. This parameter is called momentum and its used during the calculation of weight gradients

The subscript “old” refers to the gradient calculated in the previous step.

Putting the code together

Data Import and Processing

import pandas as pd
import numpy as np


np.random.seed(42)
data = pd.read_csv('../input/fashionmnist/fashion-mnist_train.csv')
print(data.shape)
data = data.sample(frac=1)
print(data[['label']].groupby('label').size().reset_index())

one_hot = pd.get_dummies(data['label'].unique())
one_hot['label'] = one_hot.index

data = pd.merge(data,one_hot)
#data = data.drop('label',axis=1)
data = data.sample(frac=1)

data_train = data
data_test = pd.read_csv('../input/fashionmnist/fashion-mnist_test.csv')
data_test = pd.merge(data_test,one_hot)
data_train.drop('label',axis=1,inplace=True)

data_test.drop('label',axis=1,inplace=True)

## Create the train and test set
X_train = np.array(data_train.drop([0,1,2,3,4,5,6,7,8,9],axis=1).values)/255
y_train = np.array(data_train[[0,1,2,3,4,5,6,7,8,9]].values)
X_test = np.array(data_test.drop([0,1,2,3,4,5,6,7,8,9],axis=1).values)/255
y_test = np.array(data_test[[0,1,2,3,4,5,6,7,8,9]].values)

Preparing Input Data and Initializing the weight Matrices

X_train = X_train.T
y_train = y_train.T
print(X_train.shape)
print(y_train.shape)
X_test = X_test.T
y_test = y_test.T

def sigmoid(x):
return(1./(1+np.exp(-x)))

def softmax(x):
"""Compute softmax values for each sets of scores in x."""

e_x = np.exp(x - np.max(x))

return (e_x / e_x.sum(axis=0))

import random
random.seed(42)
w1 = np.random.rand(128,784)/np.sqrt(784)
b0 = np.zeros((128,1))/np.sqrt(784)
w2 = np.random.rand(10,128)/np.sqrt(128)
b1 = np.zeros((10,1))/np.sqrt(128)
loss=[]
batches = 1000

lr = 0.1
batch_size = 200
beta = 0.9
count = 0
epochs = 500

Running Forward Pass and Backpropgation

loss_weight_dict = {

}
### Forward Pass
for i in range(epochs):
# if i%100==0:
# print('Epoch :',i)
permutation = np.random.permutation(X_train.shape[1])
X_train_shuffled = X_train[:, permutation]
Y_train_shuffled = y_train[:, permutation]

for j in range(batches):

begin = j * batch_size
end = min(begin + batch_size, X_train.shape[1] - 1)
if begin>end:
continue
X = X_train_shuffled[:, begin:end]
Y = Y_train_shuffled[:, begin:end]
m_batch = end - begin
x1 = sigmoid(w1@X+b0)
x2 = softmax(w2@x1+b1)
delta_2 = (x2-Y)
delta_1 = np.multiply(w2.T@delta_2, np.multiply(x1,1-x1))
if i==0 :
dW1 = delta_1@X.T
dW2 = delta_2@x1.T
db0 = np.sum(delta_1,axis=1,keepdims=True)
db1 = np.sum(delta_2,axis=1,keepdims=True)
else:
dW1_old = dW1
dW2_old = dW2
db0_old = db0
db1_old = db1
dW1 = delta_1@X.T
dW2 = delta_2@x1.T
db0 = np.sum(delta_1,axis=1,keepdims=True)
db1 = np.sum(delta_2,axis=1,keepdims=True)
## Using the past gradients to calculate the present gradients
dW1 = (beta * dW1_old + (1. - beta) * dW1)
db0 = (beta * db0_old + (1. - beta) * db0)
dW2 = (beta * dW2_old + (1. - beta) * dW2)
db1 = (beta * db1_old + (1. - beta) * db1)


w1 = w1 - (1./m_batch)*(dW1)*lr
b0 = b0 - (1./m_batch)*(db0)*(lr)
w2 = w2 - (1./m_batch)*(dW2)*lr
b1 = b1 - (1./m_batch)*(db1)*(lr)

x1 = sigmoid(w1@X_train+b0)
x2_train = softmax(w2@x1+b1)
x2_train_df = pd.DataFrame(x2_train)
x2_train_df = (x2_train_df == x2_train_df.max()).astype(int)
x2_train_df = x2_train_df.transpose()
x2_train_df = pd.merge(x2_train_df,one_hot)
x2_train_df = x2_train_df[['label']]
y_train_df = pd.merge(pd.DataFrame(y_train.T),one_hot)
x2_train_df['label_actual'] = y_train_df['label']
train_accuracy = np.sum(x2_train_df['label_actual']==x2_train_df['label'])/x2_train_df.shape[0]


# print('Training Loss...')
# print(-np.mean(np.multiply(y_train,np.log(x2))))
add_loss = {
'loss' : -np.mean(np.multiply(y_train,np.log(x2_train))),
'weight_1' : w1,
'weight_2':w2,
'b0' : b0,
'b1': b1,
'train_accuracy': train_accuracy
}





x1 = sigmoid(w1@X_test+b0)
x2_test = softmax(w2@x1+b1)
x2_test_df = pd.DataFrame(x2_test)
x2_test_df = (x2_test_df == x2_test_df.max()).astype(int)
x2_test_df = x2_test_df.transpose()
x2_test_df = pd.merge(x2_test_df,one_hot)
x2_test_df = x2_test_df[['label']]
y_test_df = pd.merge(pd.DataFrame(y_test.T),one_hot)
x2_test_df['label_actual'] = y_test_df['label']
test_accuracy = np.sum(x2_test_df['label_actual']==x2_test_df['label'])/x2_test_df.shape[0]
print('Epoch: ',i)

print('Testing Accuracy :',test_accuracy)
print('Training Accuracy :',train_accuracy)
print('----------------------------------------')



# print('Testing Loss...')
# print(-np.mean(np.multiply(y_test,np.log(x2))))

add_loss['testing_loss'] = -np.mean(np.multiply(y_test,np.log(x2_test)))
add_loss['test_accuracy'] = test_accuracy
loss_weight_dict[count] = add_loss
count = count + 1

Plotting Train Accuracy

train_accuracy = []

for i in range(len(loss_weight_dict)):
train_accuracy.append(loss_weight_dict[i]['train_accuracy'])
import matplotlib.pyplot as plt
plt.plot(train_accuracy)
plt.xlabel('Epochs')
plt.ylabel('Training Accuracy')
plt.show()
Max Train Accuracy ~96%

Plotting Test Accuracy

test_accuracy = []

for i in range(len(loss_weight_dict)):
test_accuracy.append(loss_weight_dict[i]['test_accuracy'])
import matplotlib.pyplot as plt
plt.plot(test_accuracy)
plt.xlabel('Epochs')
plt.ylabel('Testing Accuracy')
plt.show()
Max Test Accuracy ~69%

Checking Test Accuracy for Epoch where Test Accuracy is max

index_max = test_accuracy.index(max(test_accuracy))
weight_1 = loss_weight_dict[index_max]['weight_1']
weight_2 = loss_weight_dict[index_max]['weight_2']
b0 = loss_weight_dict[index_max]['b0']
b1 = loss_weight_dict[index_max]['b1']
test_data = pd.read_csv('../input/fashionmnist/fashion-mnist_test.csv')
test_data_mod = pd.merge(test_data,one_hot)
test_data_algo = test_data_mod.drop(['label',0,1,2,3,4,5,6,7,8,9],axis=1)
test_vector = np.array(test_data_algo.values)
test_vector = test_vector.T
test_vector = test_vector/255
x1 = sigmoid(weight_1@test_vector+b0)
x2 = softmax(weight_2@x1+b1)
x2_df = pd.DataFrame(x2)
x2_df = (x2_df == x2_df.max()).astype(int)
x2_df = x2_df.transpose()
x2_df = pd.merge(x2_df,one_hot)
x2_df['label_actual'] = test_data_mod['label']
print('Test Accuracy :',np.sum(x2_df['label_actual']==x2_df['label'])/x2_df.shape[0])
Test Accuracy is ~69%

Final Results

We achieve a train accuracy of ~96% and a test accuracy of ~70%.

Resources that helped me with this article

Final Remarks

Thanks for reading! Please feel free to comment below in case of doubts/recommendations and I will do my best to get back to them. You can also email me at padhokshaja@gmail.com. If you liked this blog post, feel free to buy me a cup of coffee.

--

--

Pradeep Adhokshaja
The Startup

Data Scientist @Philips. Passionate about ML,Statistics & hiking. If you like to buy me a coffee, you can use this link https://ko-fi.com/pradeepadhokshaja