SGD_MNIST Putting it all together

Badreldin Mostafa
unpack
Published in
5 min readApr 5, 2021

This is a summary for Chapter 4 from FASTAI’s book(Practical deep learning for developers) Chapter 4: MNIST basics

The full version is available here:

The summary version of the code (Github) for this article is available here:

This code is to build a Gradient descent from Scratch based on the MNIST data set. Gradient descent is the main algorithm that enables neural networks to do their magic and perform feats like image recognition etc.

The MNIST dataset is a data set of handwritten numbers, more details are available here:

http://yann.lecun.com/exdb/mnist/

In this article, we will build a representation of a neural network that can differentiate between the handwritten numbers three (3) and (7). So a subset of the MNIST set.

Represented with a picture our system needs to identify whether its more likely that the number is a 3 or a 7.

The dependent variable Y (the output) in this case is a prediction for each picture it gets represented. The independent variable X is the picture and we need to find an equation that links both. For simplicity, we will opt for a linear equation in the form Y=wX+b

We present the system with a picture X and it should know whether it is a 7 or a 3. the link is the w and the b, we call them the equation parameters. w stands for weights and b for biases. As Xs are pictures they take a matrix format (e.g. 28X28 pixels) we usually flatten them into a vector(e.g. vector length= 28*28=784). We will need a vector of weights (w) of the same size and a scalar bias (b).

The target of the gradient descent is to identify the best values for the vectors w and b that can predict the correct answer as accurately as possible.

We start with random assignments of (w) and (b) values and see how well we do on the validation dataset. Obviously, it won’t do very well at the first take as we have chosen random parameters.

Gradient descent is a strategy that allows us to get to the optimum parameters as fast as possible, by smartly modifying them.

We calculate what is called an error function (or loss function) that is basically the difference between the results we obtain by using our current parameters versus the real expected results. We need to get this number as small as possible, the smaller it gets the better our predictions are. This is where we use the gradients. We need to move our parameters (w) and (b) in the direction opposite to the current gradient. From high school math if we want to reach a minimum of a function we try to find a value where the gradient is equal to zero. As we try to move against the gradient direction we are moving towards a local minimum of the function and since it is a loss function the smaller the value is the better the whole system gets for each prediction. I would strongly recommend reading further on gradient descent from the fastAI book as it goes in-depth.

Below is the detailed process:

Now let’s walk through the code (I would recommend using the GitHub link provided to follow along):

The first step is to get and prepare the data from the MNIST Data set.

In this previous code snippet, the data for 3s and 7s have been chosen from the MNIST data set and split into a training dataset (dset) and a validation dataset(valid_dset). Both contain the picture(x) along with the output whether its a 3 or a 7 (y). Training dataset will be used to identify the parameters (w) and (b).

In the previous code snippet we defined 3 functions to help create the parameters (W and b). The second function is the function we are trying to optimize Y=wX+b. And the third function uses the predictions made on the data and compares them to the target, this is the number we want to minimize by changing the parameters (w and b).

In this bit of code, we split the data into batches using the data loader functions.

The following 2 functions are designed to take the data to apply the model function to get the predictions while maintaining the gradients. Check the difference between the predictions and targets and modify the params by applying a learning rate LR to the parameters(w and b) opposite to the direction of the gradient.

These 2 functions are designed to actually determine how accurate the results are.

So we need to test and see the results obtained from the first iteration of parameters and look at the results.

As we can see the number we got is quite low around 73%. But as we go through multiple iterations. The overall accuracy of the system improves.

As you can see after running 30 epochs we reached an accuracy of 97.6%

--

--