In this article we will be learning about following things.
- Bias and Weights concepts in neuron.
- Basic difference between step and sigmoid function.
- Multilayer perceptron.
- Forward and Backward Propagation.
- Gradient Descent
So why are we waiting lets dig dive and understand the concepts behind all the above terms used in Neural Network.
“A neuron with step function as the activation function is a perceptron” . Let us understand with following example.
INTRODUCTION: WHAT IS NEURON ?
Let us start with simple example of classification problem — Loan Prediction. Our aim is to approve loan or not based on the salary of the person. To do that we will have to build a model that takes salary as an input and predicts if loan should be approved or not.
Suppose your bank wants to reduce the loan of risk default and hence decides to roll out loans to only those people who have salary greater than 50,000 per month and above.
The task involved for our model are as follows:
1. Take salary as input.
2. Check if salary is > 50,000 or not.
3. If condition is true only then output a “Yes”.
The above steps are like what is happening in a biological neuron which takes the input through dendrites, process the input and produces the output. Thus, model which we are talking about can also be called as a neuron.
BIAS & WEIGTHS CONCEPT:
WHAT IS BIAS?
Coming to our loan example lets look at each of the inputs closely. In general, we can have Applicant Salary, Father’s Salary, Spouse Salary that can be deciding factor to approve the loan. Our neuron will take all these features as an input and make decision which is similar to multiple dendrites which we saw in biological neuron. We can sum up all the income and check if total income crosses the threshold (benchmark) set by the bank. If it is based on previous history, we can approve loan.
Now if we bring threshold to left hand side of the equation it will look something like below,
X1 + X2 + X3 — threshold > 0
If we replace the threshold with a new term bias the updated equation would look like summation of 4 quantities where bias is the threshold.
X1 + X2 + X3 + bias > 0
Here, bias which is set by us arbitrarily is learned by the underlined data. If input exceeds the magnitude of the bias we want the neuron to give output as ‘Yes’. This event is known as Firing of Neuron.
If we want to write above relationship using equations we will say,
If X1 + X2 + X3 + bias > 0 then output should be 1
If X1 + X2 + X3 + bias <= 0 then output should be 0
Consider the equation: Z = X1 + X2 +X3 +bias, Output will be 1 if (X1 + X2 +X3 +bias > 0), else 0. This is called as a Step Function i.e. step(Z). This step function is basically used to scale the outputs of our neuron. In deep learning we have an option to choose these step functions to apply to output of the neuron. These are called as Activation Functions.
So, when we use step function as an activation function for our neuron is called as a PERCEPTRON. Hope the definition of perceptron make’s it clear now to you all ?
WHAT ARE WEIGHTS IN PERCEPTRON ?
In previous example how we saw neuron takes sum of all inputs along with bias and decide if loan should be approved or not. It means that each feature was given same importance. We can also denote this in following manner where 1’s on input neuron connection represent the weights given to input feature.
Let’s take following example to understand it in a better way. To calculate output Z in this case we will multiple each input feature with its corresponding weights and finally add them with bias.
As we see in above example all input features are given equal weights. Ideally applicant salary plays a vital role so it should be given more weights. So, considering that scenario we can consider the below example where applicant salary has been given more weightage as compared to father’s and spouse salary. We will calculate output Z and check if it is greater than 0 or not.
In above example as the output Z was greater than 0 there are high chances that the loan will be getting approved.
Now that you have an intuition or understanding how a neuron works, lets formalize the learning through math. First, we take input and multiple it by their weights. Second, we add a bias to this value. The final value is represented as Z. Once we have value of Z we use it in step function to convert it to either 1 or 0 based on which Z is positive or negative which becomes the final output of the neuron.
We should note that until now we where taking weights and bias arbitrarily, but these are learned during the training process.
DIFFERENCE BETWEEN STEP & SIGMOID FUNCTION:
In step activation function we have values either 0 or 1 whereas in sigmoid function we can see from the graph that it is continuous function which can return any value between 0 and 1 (0.2, 0.7, 0.6 etc.) These values can be treated as probabilities. Higher the probabilities imply higher the chance that loan will be getting approved.
If we change the activation function of the perceptron from step to sigmoid function, we will get the Logistic Regression model as we are using sigmoid function.
In multilayer neuron. Instead of single neuron we have two neurons which would assign some weights to our input features. Also, there would be some bias associated with these neurons. Lets us take some example for better understanding.
In above example we see that 1st neuron has given more weight to applicant income whereas 2nd neuron has given less importance to it and it is more focus on family’s income by giving high weightage to father and spouse salary. In other words, these neurons are creating new features from existing features in the data. This happens during entire training process. Remember in ML we had to create our own features, here DL model creates features on its own. Once we have F1 and F2 we apply activation function over these results. To calculate the final output, we can add one more neuron. This neuron will also have some weights and bias associated with it. This is called as a MULTILAYER PERCEPTRON.
Consider following architecture:
We have 2 layers. Layer 1 is called as HIDDEN LAYER which performs all mathematical calculation to generate features and Layer 2 is called the OUTPUT LAYER. Number of neurons in the output layer will depend upon the number of the classes in the target. Since we are focusing on Loan Prediction, which is binary classification problem, we can have single neuron in final layer.
We can also have multiple INPUT FEATURES as well which can help us to generate more features. We can also increase the size of hidden layer which will extract some features from existing data. Now these features can be further be used to extract some more new features. To do so, we can add more layers for creating new features. So, considering the below example we see that we have 2 hidden layers.
Link: TensorFlow Playground (You all can visit TensorFlow Playground link to visualize neural network and play with values)
SOME BACKGROUND — TENSORFLOW PLAYGROUND
Neural network works differently in identifying the decision boundaries. Initially weights and bias are randomly initialized. Referring to TensorFlow Playground X1 (Vertical position) and X2(Horizontal position) represent input features. If we just consider X1 and X2 we can derive at a naïve decision which will not fit our solution which we want.
Neural network considers the decisions such as vertical or horizontal decision boundaries along with diagonal decisions boundaries made by neurons in hidden layers, combines all of them and makes a final decision.
FORWARD & BACWARD PROPAGATION — INTUITION.
In neural network weights of layers are constantly updated to improve the predictions. Let us understand that with a simple example.
Let us assume we have following 3 features X1, X2 and X3. These features are sent to the first hidden layer and connection between inputs and hidden layers are assigned some weights and bias. Based upon the weights and bias we would be getting some outputs. Previously in perceptron we denoted the output as Z but since we have more neurons now, we will denote as Z11 which denotes output from 1st neuron, 1st layer. After this we apply activation function which can be denoted as H11. Same things are done for 2nd neuron which would result into Z12 and H12. The outputs (H11 and H12) are then sent to next layer. Again, some random weights and bias are initialized, and same process is repeated and Z21 is calculated. Post that we apply the activation function to generate the final output. The output here is denoted as ‘O’. This complete process is called as FORWARD PROPAGATION.
The output ‘O’ which we obtained may or may not be correct. Let say actual value was 1 (Loan Approved) but our model predicted as 0.3 (Loan Not Approved). So, we see that predictions are incorrect. So, to deal with that we will calculate the error e.g. MSE (Mean Square Error) which calculates the square difference between actual and predicted values. When the predictions are large it implies that predictions are different from the actual value. This is the error for single observation. We calculate this error for all the observations and called it as COST FUNCTION / LOSS FUNCTION / ERROR FUNCTION.
Our task is to minimize this error function. This error is clearly dependent on output Y and output O. We cannot change the output Y as these are the actual target values, so we are left with changing the output O. The value of O is dependent upon various factors like input features, weights, biases, and activation function. Out of the 4 factors we cannot change the input feature as they are fixed also activation function, we need to define at the start of the training process. These are called as hyperparameters which we define before the start of training process. The only thing which we can change are the weights and bias during the training process which are known as parameters. So to reduce the error we need to update the weights and bias of the network and send back again for training with updated weights exactly what we saw in TensorFlow Playground. This process of updating the weights in neural network is called as BACKWARD PROPAGATION.
One question which would be arising in our mind is that how we will update our weights and bias to improve the predictions of the model. To do that we introduce GRADIENT DESCENT
To discuss about how we update our weights let us understand the methodology. One such method could be updating the weights and bias manually and plot the cost function against the weights or bias values. For simplicity purpose in below figure we consider plot for only single weight and omit the rest of the weights and biases to understand it better. Also, there are 2 important things we must consider while deciding the error or cost function first, it should be continuous and second, it should be differentiable at each point.
Suppose now we initialize our neural network with weights and performed forward propagation and got error as highlighted in figure 1 below. Our aim is to reach the minimum point. Now suppose we randomly updated the weight values and it gave us following new error in figure 2. We see that error has been reduced. Suppose again we update the weights and we saw increase in the cost function which can been seen in figure 3. So, it is totally by random chance we will get the lowest error. What should be our right approach?
Considering below figure we need to follow 2 things to get the lowest error.
1) Direction of movement.
2) Magnitude of the movement.
In our case, we will be moving towards left side and by how much. So, in this place GRADIENT DESCENT comes into picture to help us.
The gradient descent equation is given as: w = w — alpha * dE/dw.
dE/dw -> rate of change of error with respect to weights.
The sign of the partial derivative tells us in which direction we should move so that error is minimized.This is why we need continuous and differnetiable cost function because we calculate the partial derivative to update the weights and bias. The magnitude of the partial derivative tells us how much we should move in that particular direction.
alpha -> learning rate.
This is used to control the updation of the parameters in terms of magnitude. So, we should update the value of W by 0.1 times of dE/dW or 10 times of dE/dW that depends upon the value of alpha which we set. If we keep learning rate too high the error will fluctuate much and we might not reach up ending the minima. Also, if we set the alpha value to low the updates will be very slow and we would be needing more iterations to reach minimum value. So, deciding right value of alpha is very crucial while defining your model. Generally it is set to value between 0.001 to 0.01.
STEPS TO PERFORM GRADIENT DESCENT.
1) We will take current value of w and b.
2) Take a step in stephest downhill direction.
3) Repeat the previous step untill minima is reached.
At initial value of w we can observe the value of cost function as below and calculate the partial derivative at that point. We will see the slope turns out to be positive and the updated weight would be the prevoius weight minus some positive value as alpha and dE/dw are positive which means weight will be reduced and we will move in left direction (If slope is positive we reduce the value of weight). This is how gradient descent works in minimizing the cost functions by updating the parameters.
We can also stop updating the parameters (weights) if the number of iterations is achieved i.e. say nos. of epochs to be 100 which means even if minimum loss is not achieved we will stop the gradient descent algorithm after 100 epochs or errors have stopped updating anymore.
Here is a quick recap what we have done so far. We had set of inputs. These inputs where send to hidden layer and we calculated the Z value which was sum of the inputs multiplied by the weights. After this operation we used activation function to calculate output H. We used sigmoid activation function. The results where then sent to next layer where we had some weights and bias and we then calculated output O.
Hope you liked it. If so please give a thumps up :)
Do connect with me on LinkedIn : https://www.linkedin.com/in/gaurav-rajpal/
Stay tuned for further updates on Activation Function / Optimizers and demo projects on Deep Learning.
Gaurav Rajpal (email@example.com)