Linear Regression using Gradient Descent Algorithm
Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model.
When there are more than one inputs you can use a process of optimizing values of coefficients by iteratively minimizing error of model on your training data. This is called Gradient Descent and works by starting with random values for each coefficient. The sum of squared errors are calculated for each pair of input and output variables.
A learning rate is used for each pair of input and output values. It is a scalar factor and coefficients are updated in direction towards minimizing error. The process is repeated until a minimum sum squared error is achieved or no further improvement is possible.
When using this method, learning rate alpha determines the size of improvement step to take on each iteration of procedure. In practise, Gradient Descent is useful when there is a large dataset either in number of rows or number of columns.
In short ,it is a minimization algorithm meant for minimizing a given activation function.
To understand this in simple terms, let us see an example on how it works…
Imagine a valley and a person with no sense of direction who wants to get to the bottom of the valley. He goes down the slope and takes large steps when the slope is steep and small steps when the slope is less steep. He decides his next position based on his current position and stops when he gets to the bottom of the valley which was his goal.
Coming to our topic, The Gradient Descent algorithm determines the values of m and c, such that line corresponding to those values is best fitting line / gives minimum error.
First we need to calculate the loss function. The loss function is defined as the error in our predicted value of m and c. Our aim is to minimize this error to obtain most accurate value of m and c. To calculate the loss, Mean Squared Error function is used.
· The difference between actual and predicted y value if found.
· The difference is squared.
· Then, the mean of squares for every value in x is calculated.
Now, we have to minimize “m” and “c”. To minimize these parameters, Gradient Descent Algorithm is used.
1. Initially, let m = 0, c = 0
Where L = learning rate — controlling how much the value of “m” changes with each step. The smaller the L, greater the accuracy. L = 0.001 for a good accuracy.
2. Calculating the partial derivative of loss function wrt “m” and give current values of x, y , m and c to get the derivative D.
3. Similarly, D wrt c
4. Now, updating the current value of m and c,
5. We repeat this process until loss function is very small ,i.e. ideally 0 % error (100% accuracy).
Going back to analogy, m = current position, D = steepness of slope, L = speed at which person moves (learning rate). Every new “m” value calculated is the next position.
When the learning rate is quite high, the process will be a very haphazard one. So, the smaller the learning rate, the better the result of the model will be.
After repeating this process for many times, the person arrives at the bottom of the valley. With optimum value of m and c, model is ready.
NOTE: In the case of linear regression model, there is only one minimum and it is the global minimum. Hence we say that the required coefficients for the line is found.
In such case (as above example figure), the local minimum reached depends on the initial coefficients taken into consideration. Here, point A, B are termed Local Minimum and point C is Global Minimum.