Neural Networks 101 — Part 1
The term neural network doesn’t need an introduction at all, only a few know the power of a neural net and a lot of people wanna learn this extraordinary tech. What are you going to read here then? Rather than discussing the types and applications of the neural network, I will be going through the seven mechanisms of the neural network that makes it powerful and versatile.
When do we call a ‘program’ a machine learning model?
This is an important question one has to answer before diving into this, if we unravel the caveats of a traditional program and replace it with some cool tweaks then we can call this a neural network.
In traditional programming, we usually write the steps sequentially for it to run. But a machine learning model is really different, rather than giving the steps on how to solve it we show them examples of a problem and let them figure out how to solve it by itself.
We said some tweaks make a traditional program a machine learning model and they are:
- The idea of “weight assignment”.
- The fact that every weight assignment has some “actual performance”.
- The requirement that there be an “automatic means” of testing the performance of the model.
- The need for a “mechanism” for improving the performance by changing the weight assignments.
Alright, the above steps might seem confusing but trust me we will get there. And this was proposed by an IBM researcher named Arthur Samuel in 1949. I have summarized his phrase below,
So a traditional program with an automatic means of weight assignments where every input gets multiplied by a weight, and we need a performance metric that will help us check whether the weights are helpful and if it's not, we need a “mechanism” that will automatically update the weights. This process goes on and on till we get the desired output. But remember this is a short explanation and we will dig deep into the above steps in a more detailed manner.
Converting traditional program into a full-blown machine learning model
Our goal is to create a program that can recognize images of 3s and 7s and remember using only traditional programming.
How about finding the average pixel value for every pixel of the 3s, then do the same for the 7s? Now computing this will give us two groups of averages, and we can call them an ideal 3 and ideal 7. Now we compare our images of 3s and 7s with our ideal images and this will help us to classify as one digit or another.
So what we do is basically find the difference between the pixels of 3s and 7s with the pixels of ideal images. Coming to the main theme, how do we convert this into a fully functional machine learning model?
There are no parameters available for our pixel similarity program and we clearly don’t have the following things:
- any kind of weight assignment of course
- any way of improving based on testing the effectiveness of the weight assignment.
If we get to incorporate these tweaks into our pixel similarity program, this could be called a machine learning model indeed. Let's have this high-level overview of the conversion,
- We could look at each pixel and come up with a set of weights for each, such that the highest weights are associated with those pixels most likely to be black for a particular category.
- Just assume that pixel towards the bottom right isn’t very likely to be activated for a 7, so we can say that 7 will have a low weight but they are activated for 8, so now 8 would have huge weight.
For so long we have been saying there is a tweak but that’s not an ideal way to address it, we can call this an optimization function that will help in both the weight assignment and improving it by testing the effectiveness of the weights provided. To be more specific, we will be looking into Stochastic Gradient Function (SGD), the most common optimization function used in a neural network.
In short, searching for the best weight assignment in a pixel is a way to search for the best function recognizing 3s and 7s.
Below are the steps we are going to perform to convert our program into a machine learning model:
- Step 1: Find a way to initialize random weights.
- Step 2: And for each image, use these weights to predict whether it appears to be a `3` or a `7`.
- Step 3: Based on the above predictions calculate how good the model is. This is where we introduce the term called loss function.
- Step 4: Calculate the gradient, which plays a crucial role in weight assignment. It will tell us how to change the weights so that our loss would change.
- Step 5: Step the weights, that is change the weights based on the gradients calculated.
- Step 6: Go back to Step 2 and repeat the process.
- Step 7: Iterate until we decide to stop the training process.
Alright, there are some jargon terms hidden up there, let’s clear up some of them.
We initialize the parameters (or) weights to random values at first. It’s believed starting with random weights (or) values works perfectly well.
A function will return a number that is small when the performance of the model is good. The standard approach is to treat a small loss as a good sign and a large loss as a bad one.
A simple way to figure out whether weight should be increased a bit or decreased would be just to try to increase the weight by a small amount and observe the loss goes up or down. We do this increment and decrement until we find an amount that satisfies us.
However, we use calculus to take care of this. Finding which direction and roughly how much, to change each weight without doing those adjustments above. We do this by calculating gradients. This is just a performance optimization.
This is the phase where we choose the epochs to train the model for, we would keep training until the accuracy of the model started getting worse or ran out of time.
What are gradients all about?
We have been talking about performance optimization for so long, but what does it even mean?
Machine Learning is all about math, by leveraging the power of math we can automate some processes with more efficiency. In our SGD optimization process, we use calculus to improve the performance of our optimization step.
Calculus will help us quickly to calculate whether our loss will go up or down depending upon how we adjust the parameters in our model. The gradient is an important phenomenon that helps us to find the best parameters for our model.
In simple words, gradients will tell us how much we have to change each weight to make our model better. Now we will see the most general definition of the gradients,
Gradient is defined as rise/run that is the change in the value of the function, divided by the change in the value of the parameter.
When we talk about calculus we shouldn't forget about the term derivative, it calculates the change of an equation. For instance, the derivative of the quadratic function at value 3 tells us how rapidly the function changes at value 3. In short, a derivative is the rate of change of an equation.
The idea here is when we know how our loss function will change, we know what we need to do to make it smaller. And the important mechanism in machine learning is having a way to change the parameter of a loss function to make it smaller. And we call the mechanism gradients.
While calculating the loss function it will return not one but a lot of weights (parameters), and when we calculating the derivative we will get a gradient for every weight (parameter).
And the process of calculating the gradients for a model is also known as backpropagation.
We’ve covered a fair bit of jargon so far and in the next series of blogs, we will jump right into some code where we will be implementing the above 7 steps in Pytorch.
Now you can read Part 2 of this blog Neural Networks 101 Part 2
Special thanks to Fastai and the community for their great course, without it I couldn’t have written this.