In the previous post i demonstrated the basic intuition behind Deep Neural Networks. I also demonstrated how to write an Image Classification system that achieves over 95% accuracy on the task of classifying Hand written digits. However, for the purpose of simplifying the introduction, i deliberately avoided discussing the full details of how neural networks works. Hence, i shall make up for that in this post.

Above is a neural network as explained in the previous post, each single neuron computes its output based on two parameters. W — Weight and b = Bias.

These weights and bias are initialized randomly at the start, but through a sophiscated learning process, the values are set correctly in such a way as to maximise the ability of the network to perform accurate classification on a given task.

The sole purpose of training a neural network with many epochs is to be able to discover the best values for these parameters.

At the start of the training, the parameters is randomly initialized. The neural network would use these weights and biases to predict the class of each image, most likely, the result would be largely incorrect. The parameters would then be adjusted and used to predict the images again, these process would be repeated with the parameters adjusted in the direction that leads to higher prediction accuracy. This direction is determined by a loss function. The loss function tells us how well we are doing with our current set of weights.

First i shall explicate how these loss functions are computed, then i would explain how the parameters are updated.


The ultimate goal of training our network is to have a set of parameters that result in the least possible loss. To compute this loss takes a bit of math.

First, the standard loss function used in modern neural networks is Softmax Cross Entropy.

Here is how it works. A set of parameters used to compute the activations of the network would yield a set of scores, which i shall denote as Vector S.

Note that S would be equal in length to the final layer in our network. For example, while classifying hand written digits with a class of 0–9 , meaning 10 classes, S would have 10 elements, each element representing the score for each class. For a given image sample, the network would output the score for the class 0, 1,2 ,3…10. Think of it as 10 classifiers working in parallel and the one with the highest score is the one we believe to be correct.

Lets say, S = (2,8,5,3,1,1,4,2,6,3) this corresponds to scores for the classes [0,1,2,3,4,5,6,7,8,9]

Here we can see that the 1 has the highest score. But it is not too clear how to interpret these numbers and what exactly they mean. Softmax helps us to convert these scores into probabilities. That is much easier to interpret.

Softmax takes the form

The above equation might seem intimidating, i must admit that it took a while for me to understand it, thanks to Stanford’s CS231n, i finally fully understood it.

The equation can be simplified as follows:

Step 1: Take the exponent of each single score

The output of these would be a Vector V = [7.39,2980.96,148.41,20.09,2.72,2.72,54.60,7.39,403.43,20.09]

Step2: Compute the sum of the exponents

The output of these would be 3647.8

Step3: Divide each exponent by the sum of the exponents

The output of these is :


When you sum this up, you get approximately 1, if i hadn’t approximated each single output, the total sum would be exatly 1. These shows us that the output of the softmax operation is indeed the probabilities of the classes.

Not to forget, we are seeking for the cross entropy loss. Now that we have our output probabilities.

The cross entropy loss is defined as

where S is the ouput of the softmax and j is the index of the correct class.

Notice that these softmax outputs 0.8 probability for the digit 1, lets imagine the actual digit as provided by the training data is 3, hence

The higher the probability of the correct class, the lesser these error would be.

These would then be averaged over all the training images in a single batch as defined by the batch size we set.

So, now we have the softmax function for computing the probabilities as well as a loss function that tells us how well we did. Knowing we are doing very bad in our early epochs because our parameters are guesses. How do we adjust these parameters to give us lesser loss? Here comes in Stochastic Gradient Descent


Look at the valley above, lets imagine at initialization of our parameters, we are at the top, where the height is the loss. This means, the higher we are, the higher the loss, imagine our parameters dictates our x, y and z position in this space. What we want to do is to be at the bottom of the valley where our loss is lowest. So how do we arrive there. Forget momentum, friction and the laws of physics, at least for this moment. To adjust millions of parameters in such a way that we eventually reach the bottom of the valley, we have to continually alter the parameters of the network in such a way that the move in the negative direction. Essentially we need to gradually subtract some values from each parameter in such ways that we shall converge at the bottom. The rate at which we move down this valley is known as the learning rate.

Stochastic Gradient Descent solves this problem using a very simple technique.

For a given parameter W, SGD performs the following adjustment

It essentially computes the partial derivatives of the change in the parameters with respect to the change in the loss.

These derivatives are computed via Backpropagation.

Consult Michael Nielsen’s tutorial if you want to go deeper into SGD and back propagation.

The most important thing you need to understand here is how your learning rate affects learning. When training real world datasets, you should adjust your learning rates as training proceeds, using higher learning rates initially and reducing learning rates as training goes on. I shall demonstrate how to do this in Keras in next tutorials.

One last component i promised to explain is ONE HOT ENCODING, keras provides a simple function to do this, but you need to understand how it does it.

Its a simple method of converting string labels into an array such that all elements are zero except a one at the index of the class.

For example.

If there are three classes, [Madrid,Barcelona,Milan]

The One Hot encoding of Madrid would be = [1,0,0]

Barcelona = [0,1,0]

Milan = [0,0,1]

Hence [Madrid,Barcelona,Milan] = [[1,0,0],[0,1,0],[0,0,1]]

Softmax cross entropy can properly use this, simple reason is that, to compute the loss, we need to pick the probability of the correct class, hence, when the target is provided in this form, cross entropy would simply lookup the index of 1 and use that to pick the probability whose loss we shall find.

where j would be the index of 1

That concludes this post, in the next post, i would explain how to build an even more advanced image classification system using “Convolutional Neural Networks.” So stay tuned.

Any question is welcomed. Just comment below and I will get back to you.

You can always reach me on Twitter via @johnolafenwa

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store