Fashion product image classification using Neural Networks | Machine Learning from Scratch (Part VI)

Learn how to process image data and classify fashion products by building a Neural Network from scratch in Python

Venelin Valkov
Towards Data Science

--

TL;DR Build Neural Network in Python from scratch. Use the model to classify images of fashion products into 1 of 10 classes.

We live in the age of Instagram, YouTube, and Twitter. Images and video (a sequence of images) dominate the way millennials and other weirdos consume information.

Having models that understand what images show can be crucial for understanding your emotional state (yes, you might get a personalized Coke ad right after you post your breakup selfie on Instagram), location, interests and social group.

Predominantly, models that understand image data used in practice are (Deep) Neural Networks. Here, we’ll implement a Neural Network image classifier from scratch in Python.

Image Data

Hopefully, it’s not a complete surprise to you that computers can’t actually see images as we do. Each image on your device is represented/stored as a matrix, where each pixel is one or more numbers.

Reading the fashion products data

Fashion-MNIST is a dataset of Zalando’s article images — consisting of a training set of _60,000_ examples and a test set of _10,000_ examples. Each example is a _28x28_ grayscale image, associated with a label from _10_ classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.

Here is a sample of the images:

You might be familiar with the original handwritten digits MNIST dataset and wondering why we’re not using it? Well, it might be too easy to make predictions on. And of course, fashion is cooler, right?

Exploration

The product images are grayscale, 28x28 pixels and look something like this:

Here are the first 3 rows from the pixel matrix of the image:

Note that the values are in the 0–255 range (grayscale).

We have 10 classes of possible fashion products:

Let’s have a look at a lower dimensional representation of some of the products using t-SNE. We’ll transform the data into 2-dimensional using the implementation from scikit-learn:

You can observe a clear separation between some classes and significant overlap between others. Let’s build a Neural Network that can try to separate between different fashion products!

Neural Networks

Neural Networks (NNs), Deep Neural Networks in particular, are all the rage in the last couple of years in the Machine Learning realm. That’s hardly a surprise since most state-of-the-art results (SOTA) on various Machine Learning problems are obtained via Neural Nets.

Browse papers with source code achieving SOTA

The Artificial Neuron

The goal of modeling our biological neuron has led to the invention of the artificial neuron. Here is how a single neuron in your brain looks like:

source: CS231n

On the other side, we have a vastly simplified mathematical model that turns out to be extremely useful in practice (as evident by the success of Neural Nets):

source: CS231n

The idea of the artificial neuron is simple — you have data vector X coming from somewhere, a vector of parameters W and a bias vector b. The output of a neuron is given by:

where f is an activation function that controls how strong the output signal of the neuron is.

Architecting Neural Networks

You can use a single neuron as a classifier, but the fun part begins when you group them into layers. Concretely, the neurons are connected into an acyclic graph with the data flowing between layers:

source: CS231n

This simple Neural Network contains:

  • Input layer — 3 neurons that should match the size of your input data
  • Hidden layer — 4 neurons with weights W that your model should learn during training
  • Output layer — 2 neurons that provide the predictions of your model

Want to build a Deep Neural Network? Just add at least one more hidden layer:

source: CS231n

Sigmoid

The sigmoid function is quite commonly used activation function, at least it was until recently. It has a distinct S shape, it is a differentiable real function for any real input value and output values between 00 and 11. Additionally, it has a positive derivative at each point. More importantly, we will use it as an activation function for the hidden layer of our model.

Here’s how it is defined:

Here is how we can implement it:

Its first derivative (which we will use during the backpropagation step of our training algorithm) has the following formula:

Our implementation reuses the sigmoid implementation itself:

Softmax

The softmax function can be easily differentiated, it is pure (output depends only on input) and the elements of the resulting vector sum to 1. Here it is:

Here is the Python implementation:

In probability theory, the output of the softmax function is sometimes used as a representation of a categorical distribution. Let’s see an example result:

The output has most of its weight corresponding to the input 8. The softmax function highlights the largest value(s) and suppresses the smaller ones.

Backpropagation

Backpropagation is the backbone of almost anything we do when using Neural Networks. The algorithm consists of 3 subtasks:

  1. Make a forward pass
  2. Calculate the error
  3. Make backward pass (backpropagation)

In the first step, backprop uses the data and the weights of the network to compute a prediction. Next, the error is computed based on the prediction and the provided labels. The final step propagates the error through the network, starting from the final layer. Thus, the weights get updated based on the error, little by little.

Let’s build more intuition about what the algorithm is actually doing:

Solving XOR

We will try to create a Neural Network that can properly predict values from the XOR function. Here is its truth table:

Here is a visual representation:

Let start by defining some parameters:

The epochs parameter controls how many times our algorithm will “see” the data during training. Then we set the number of neurons in the input, hidden and output layers — we have 2 numbers as input and 1 number as output size. The learning rate parameter controls how quickly our Neural Network will learn from new data and forget what already knows.

Our training data (from the table) looks like this:

The W vectors in our NN need to have some initial values. We’ll sample a uniform distribution, initialized with proper size:

Finally, implementation of the Backprop algorithm:

That error seems to be decreasing! YaY! And the implementation is not that scary, isn’t it?

During the forward step, we take the dot product of the data X and W_hidden​ and apply our activation function to obtain the output of our hidden layer. We obtain the predictions by taking the dot product of the hidden layer output and W_output​.

To obtain the error, we calculate the difference between the true values and the predicted ones. Note that this is a very crude metric, but it works fine for our example.

Finally, we use the calculated error to adjust the weights. Note that we need the results from the forward pass act_hidden to calculate W_output​ and calculate the first derivative using sigmoid_prime to update W_hidden​.

In order to make an inference (predictions) we’ll do just the forward step (since we won’t adjust W based on the result):

Our sorcery seems to be working! The prediction is correct!

Classifying Images

Building a Neural Network

Our Neural Network will have only 1 hidden layer. We will implement a somewhat more sophisticated version of our training algorithm shown above along with some handy methods.

Initializing the weights

We’ll sample a uniform distribution with values between -1 and 1 for our initial weights. Here is the implementation:

Training

Let’s have a look at the training method:

For each epoch, we apply the backprop algorithm, evaluate the error and the gradient with respect to the weights. We then use the learning rate and gradients to update the weights.

Doing a backprop step is a bit more complicated than our XOR example. We do an additional step before returning the gradients — apply L1 and L2 Regularization. Regularization is used to guide our training towards simpler methods by penalizing large values for our parameters W.

Our forward and backward steps are very similar to the one in our previous example, how about the error?

Measuring the error

We’re going to use Cross-Entropy loss (known as log loss) function to evaluate the error. This function measures the performance of a classification model whose output is a probability. It penalizes (harshly) predictions that are wrong and confident. Here is the definition:

where C is the number of classes, y is a binary indicator if class label is the correct classification for the observation and p is the predicted probability that o is of class c

The implementation in Python looks like this:

Now that we have our loss function, we can finally define the error for our model:

After computing the Cross-Entropy loss, we add the regularization terms and calculate the mean error. Here is the implementation for L1 and L2 regularizations:

Making predictions

Now that our model can learn from data, it is time to make predictions on data it hasn’t seen before. We’re going to implement two methods for prediction — predict and predict_proba:

Recall that predictions in NN (generally) includes applying a forward step on the data. But the result of it is a vector of values representing how strong the belief for each class is for the data. We’ll use Maximum likelihood estimation (MLE) to obtain our final predictions:

MLE works by picking the highest value and return it as a predicted class for the input.

The method predict_proba returns a probability distribution over all classes, representing how likely each class is to be correct. Note that we obtain it by applying the softmax function to the result of the forward step.

Evaluation

Time to put our NN model to the test. Here’s how we can train it:

The training might take some time, so please be patient. Let’s get the predictions:

First, let’s have a look at the training error:

Something looks fishy here, seems like our model can’t continue to reduce the error 150 epochs or so. Let’s have a look at a single prediction:

That one seems correct! Let’s have a look at few more:

Not too good. How about the training & testing accuracy:

Well, those don’t look that good. While a random classifier will return ~10% accuracy, ~50% accuracy on the test dataset will not make a practical classifier either.

Improving the accuracy

That “jagged” line on the training error chart shows the inability of our model to converge. Recall that we use the Backpropagation algorithm to train our model. Training Neural Nets converge much faster when data is normalized.

We’ll use scikit-learn`s scale to normalize our data. The documentation states that:

Center to the mean and component wise scale to unit variance.

Here is the new training method:

Let’s have a look at the error:

The error seems a lot more stable and settles at a lower point — ~200 vs ~400. Let’s have a look at some predictions:

Those look much better, too! Finally, the accuracy:

~87% (vs ~50%) on the training set is a vast improvement over the unscaled method. Finally, your hard work paid off!

Conclusion

What a ride! I hope you got a blast working on your first Neural Network from scratch, too!

You learned how to process image data, transform it, and use it to train your Neural Network. We used some handy tricks (scaling) to vastly improve the performance of the classifier.

Originally published at https://www.curiousily.com.

--

--