Published in

Neural Networks 101

Let’s unravel the working principle behind the neural network.

Photo by Salmen Bejaoui on Unsplash

The term neural network doesn’t need an introduction at all, only a few know the power of a neural net and a lot of people wanna learn this extraordinary tech. What are you going to read here then? Rather than discussing the types and applications of the neural network, I will be going through the seven mechanisms of the neural network that makes it powerful and versatile.

When do we call a ‘program’ a machine learning model?

This is an important question one has to answer before diving into this, if we unravel the caveats of a traditional program and replace it with some cool tweaks then we can call this a neural network.

In traditional programming, we usually write the steps sequentially for it to run. But a machine learning model is really different, rather than giving the steps on how to solve it we show them examples of a problem and let them figure out how to solve it by itself.

We said some tweaks make a traditional program a machine learning model and they are:

  • The idea of “weight assignment”.
  • The fact that every weight assignment has some “actual performance”.
  • The requirement that there be an “automatic means” of testing the performance of the model.
  • The need for a “mechanism” for improving the performance by changing the weight assignments.

Alright, the above steps might seem confusing but trust me we will get there. And this was proposed by an IBM researcher named Arthur Samuel in 1949. I have summarized his phrase below,

So a traditional program with an automatic means of weight assignments where every input gets multiplied by a weight, and we need a performance metric that will help us check whether the weights are helpful and if it’s not, we need a “mechanism” that will automatically update the weights. This process goes on and on till we get the desired output. But remember this is a short explanation and we will dig deep into the above steps in a more detailed manner.

Converting traditional program into a full-blown machine learning model

Our goal is to create a program that can recognize images of 3s and 7s and remember using only traditional programming.

How about finding the average pixel value for every pixel of the 3s, then do the same for the 7s? Now computing this will give us two groups of averages, and we can call them an ideal 3 and ideal 7. Now we compare our images of 3s and 7s with our ideal images and this will help us to classify as one digit or another.

So what we do is basically find the difference between the pixels of 3s and 7s with the pixels of ideal images. Coming to the main theme, how do we convert this into a fully functional machine learning model?

There are no parameters available for our pixel similarity program and we clearly don’t have the following things:

  • any kind of weight assignment of course
  • any way of improving based on testing the effectiveness of the weight assignment.

If we get to incorporate these tweaks into our pixel similarity program, this could be called a machine learning model indeed. Let’s have this high-level overview of the conversion,

  • We could look at each pixel and come up with a set of weights for each, such that the highest weights are associated with those pixels most likely to be black for a particular category.
  • Just assume that pixel towards the bottom right isn’t very likely to be activated for a 7, so we can say that 7 will have a low weight but they are activated for 8, so now 8 would have huge weight.

For so long we have been saying there is a tweak but that’s not an ideal way to address it, we can call this an optimization function that will help in both the weight assignment and improving it by testing the effectiveness of the weights provided. To be more specific, we will be looking into Stochastic Gradient Function (SGD), the most common optimization function used in a neural network.

In short, searching for the best weight assignment in a pixel is a way to search for the best function recognizing 3s and 7s.

Below are the steps we are going to perform to convert our program into a machine learning model:

  • Step 1: Find a way to initialize random weights.
  • Step 2: And for each image, use these weights to predict whether it appears to be a `3` or a `7`.
  • Step 3: Based on the above predictions calculate how good the model is. This is where we introduce the term called loss function.
  • Step 4: Calculate the gradient, which plays a crucial role in weight assignment. It will tell us how to change the weights so that our loss would change.
  • Step 5: Step the weights, that is change the weights based on the gradients calculated.
  • Step 6: Go back to Step 2 and repeat the process.
  • Step 7: Iterate until we decide to stop the training process.

Alright, there are some jargon terms hidden up there, let’s clear up some of them.


We initialize the parameters (or) weights to random values at first. It’s believed starting with random weights (or) values works perfectly well.


A function will return a number that is small when the performance of the model is good. The standard approach is to treat a small loss as a good sign and a large loss as a bad one.


A simple way to figure out whether weight should be increased a bit or decreased would be just to try to increase the weight by a small amount and observe the loss goes up or down. We do this increment and decrement until we find an amount that satisfies us.

However, we use calculus to take care of this. Finding which direction and roughly how much, to change each weight without doing those adjustments above. We do this by calculating gradients. This is just a performance optimization.


This is the phase where we choose the epochs to train the model for, we would keep training until the accuracy of the model started getting worse or ran out of time.

What are gradients all about?

We have been talking about performance optimization for so long, but what does it even mean?

Machine Learning is all about math, by leveraging the power of math we can automate some processes with more efficiency. In our SGD optimization process, we use calculus to improve the performance of our optimization step.

Calculus will help us quickly to calculate whether our loss will go up or down depending upon how we adjust the parameters in our model. The gradient is an important phenomenon that helps us to find the best parameters for our model.

In simple words, gradients will tell us how much we have to change each weight to make our model better. Now we will see the most general definition of the gradients,

Gradient is defined as rise/run that is the change in the value of the function, divided by the change in the value of the parameter.

When we talk about calculus we shouldn’t forget about the term derivative, it calculates the change of an equation. For instance, the derivative of the quadratic function at value 3 tells us how rapidly the function changes at value 3. In short, a derivative is the rate of change of an equation.

The idea here is when we know how our loss function will change, we know what we need to do to make it smaller. And the important mechanism in machine learning is having a way to change the parameter of a loss function to make it smaller. And we call the mechanism gradients.

While calculating the loss function it will return not one but a lot of weights (parameters), and when we calculating the derivative we will get a gradient for every weight (parameter).

And the process of calculating the gradients for a model is also known as backpropagation.

We’ve covered a fair bit of jargon so far and in the next series of blogs, we will jump right into some code where we will be implementing the above 7 steps in Pytorch.

Getting hands dirty with code

We’ve played enough with the theory, now it’s time to get our hands dirty with code. In this part, we will write PyTorch and fastai code to represent the 7 steps with actual data called MNIST.

Note: The notebook version of this blog is available here Neural Networks 101 Google Colab, feel free to run the cells and visualize the results.

Enough of talking let’s jump right in, before getting started we gotta make sure about the data even though we gonna deal with this later let’s load them.

# Loading the mnist data and untar itdata_path = untar_data(URLs.MNIST)

In the above code cell, we have just downloaded the MNIST data, and the function untar_data takes care of the download and returns us a path of where the files are stored.

We will leave it here, for now, let us see how to calculate gradients with Pytorch.

Calculating gradients with Pytorch

As we know what are gradients and why they are important let’s see how to code them using Pytorch. First will take a look at the whole code, then will gradually break down every line and see what they do.

  • xt = tensor(8.).requires_grad_() : creates a tensor at first, and by setting requires_grad_() True to any tensor in Pytorch will automatically track and calculate gradients for that tensor.
  • And we know the next step we will apply some computations with our tensor, then, at last, activate the backpropagation which helps us in getting the gradients.
  • yt.backward() : this will calculate the gradients by activating the backpropagation.
  • xt.grad : will give us the gradients calculated on this variable during the computation.

But you might be thinking why do we use require_grads_() and what’s the underlying mechanism behind this calculation. In my recent blog, I explained auto differentiation and its underlying mechanism that powers up this whole thing.

Glimpse on Auto Differentiation

Well, the first time when we are computing our loss function with our parameters it will return the partial derivatives and we call this process forward pass.

The forward pass is responsible for computing the loss function with our parameters. But we know that a neural network has to optimize its parameters to achieve the best results and that is getting a minimized loss error.

But how do we find the values that will help the neural network to find the best parameters to minimize the loss?


We have to get the gradients by activating the backpropagation (or) back pass. At first, we performed a forward pass and got our partial derivatives, and by activating the backpropagation that uses the chain rule to compute the gradients for us.

But what do all of these things have to do with auto differentiation?

Auto differentiation helps us to keep track of these computations and during the backpropagation, it just has to use these parameters to compute the gradients. And we know just with the help of partial derivatives we were able to compute the gradients of the trainable variables (weights and biases) and still able to keep a record of thousands of derivatives and gradients.

Note: The gradients will tell us only the slope of our function, they don’t really say how far we should adjust the parameters.

So how to tell our parameters the way they should move to minimize the loss?

We will use something called the learning rate.

The gradients tell us the directions but not the magnitude of the direction (i.e the step we have to take). This is where our learning rate helps, it tells us how large each step should be (or) in other words it gives us the scale of how much we should trust the gradients and step in the direction of that gradient.

So we will multiply the gradient by a small number (learning rate) to step the weights.

And End-to-End SGD Example

We are the fun part now. Let’s code the seven-step we discussed in our previous part of the blog. Before jumping into the code let’s re-visit the seven steps,

  • Step 1: Find a way to initialize random weights.
  • Step 2: And for each image, use these weights to predict whether it appears to be a 3 or a 7.
  • Step 3: Based on the above predictions calculate how good the model is. This is where we introduce the term called loss function.
  • Step 4: Calculate the gradient, which plays a crucial role in weight assignment. It will tell us how to change the weights so that our loss would change.
  • Step 5: Step the weights, that is change the weights based on the gradients calculated.
  • Step 6: Go back to Step 2 and repeat the process.
  • Step 7: Iterate until we decide to stop the training process.

Now comes the code!

As we know the first 4 steps are very similar and straightforward, so I won’t talk about that.

  • -= lr * : here we multiply our learning to our gradients and update the values. A special method tells PyTorch we want to calculate gradients w.r.t to the variable at the value. (xt → variable , 3 → value)
  • params.grad = None : making the gradients zero so it won't add up with the previous existing gradients.

Let’s create some dummy data and use our above function for the training.

It’s fine if some of the code doesn’t make sense because the whole point of this blog is focused on the gradients and the workflow that takes place during the process. Like I said before the notebook contains code packed in and people can execute it sequentially and visualize the results.

Wrapping up with Fastai

Let’s give a final touch to this blog by wrapping up with actual data and train a model that recognizes digits. Rather than using the mid-level components of Fastai in this blog, we will stick strictly with the low-level API and create a model with that.

Also, it’s fine if the code doesn’t make sense, I just wanna show people how you can use Fastai + Pytorch to build models.

Let’s break down the above code,


The Datasets expects,

  • the items we want to use
  • the transforms (how the inputs and outputs should be constructed and spits out)
  • the type of split (train and test)

Decoding the dsets :

  • PILImageBW -> creates a PIL image (accepts a file path)
  • .create -> takes care of the preprocessing before going into the model. This is applicable for both X and y, more like a custom implementation for the various inputs.
  • splits itself doesn't do the splitting, we've just created an instance of the object, where passing the items later will give us the train and test sets.


We got our filenames converted into images, but for a machine learning model, we have to convert our images into tensors (numerical representation) and make it easy for our model to learn patterns on it.

We need to give ourselves some transforms on the data! These will need to:

  • Ensure our images are all the same size
  • Make sure our output is the tensor our models are wanting
  • Give some image augmentation
# Creating transforms for our data by hand (left to right)tfms = [ToTensor() , CropPad(size = 34 , pad_mode = PadMode.Zeros) , RandomCrop(size = 28)]

We need one more thing, at last, that is the transforms applied during the GPU instance or in other words, transforms applied for every batch.

The important reason for having mini-batches is they could run on GPU, so the computations take place even faster. Also batching prevents bias during training and helps the training converge faster.

We have to load our Datasets into a DataLoaders so it will help us to batch our data and sends a batch of our whole data during the training time.

# Creating the batch transforms
gpu_tfms = [IntToFloatTensor() , Normalize()]
# Building our dataloaders
dls = dsets.dataloaders(bs = 128 , after_item= tfms , after_batch= gpu_tfms)

Let’s visualize our images.

Look at that, how beautiful it is? From file paths to actual images we’ve come a long way!

But we’ve reached our goal for this blog and the next step is creating and fitting the model. This wrapping up section is more like a shoutout to the amazing Fastai people, without them this blog wouldn’t be possible in the first place.

After building and training our model for 3 epochs or 3 iterations we will have around 98% accuracy, which means our model is doing a pretty good job of recognizing the digits.

Our model’s result

It’s advisable to look into the notebook version of this blog to get your hands dirty with the code. Links for the resource are given below. Until then,

Happy Learning!




Data Scientists must think like an artist when finding a solution when creating a piece of code. ⚪️ Artists enjoy working on interesting problems, even if there is no obvious answer ⚪️ 🔵 Follow to join our 18K+ Unique DAILY Readers 🟠

Recommended from Medium

Deep Learning Part 2 — Neural Networks and Gradient Descent

What are Language Models in NLP?

pyctcdecode — A new beam search decoder for CTC speech recognition

Introduction To Machine Learning in 180 seconds!

Unawareness of Deep Learning Mistakes

Using Neural Networks for Invoice Recognition

Exploring Crypto Market Sentiment Analysis Using NLP For The Purpose of Price Prediction

Training an AI to Play OpenAI’s Cartpole

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ashik Shaffi

Ashik Shaffi

Machine Learning Practitioner

More from Medium

Convolutional networks, recurrent neural networks and transfomers

When to Use Deep Learning

What Are Graph Neural Networks? How GNNs Work, Explained with Examples

A Fun Way to Tinker with Neural Networks