Comparison of algorithms for Linear regression.
To train a regression model we need find the optimum θ, which would minimize our MSE(mean square error).
Note: This blog is only to put together information from various resources.
This θ can be found in four ways.
- Finding θ using the normal equation, which is what the Scikit learn library’s LinearRegression does.



Although the computational complexity of this algorithm with respect to the number of instances in O(m),m being the number of instances,the computational complexity of this algorithm w.r.t the number of features is O(n³),which means that if the number of features are doubled, the computational time would increase by 8 times.So this algorithm is best suited when we have less number of features.
2.Finding θ using gradient descent/Batch gradient descent.




Batch gradient descent/gradient descent, computes the gradient of the cost function w.r.t. to the parameters θ for the entire training dataset.
If the learning rate is too slow, it may take a long time to reach the optimum solution, but if the learning rate is too high then the algorithm diverges and moves further away from the optimum solution.Hence we need to use good eta and this eta can be chosen using grid search method or using the cross validation set.
Gradient Descent scales well with the number of features; training a Linear Regression model when there are hundreds of thousands of features is much faster using Gradient Descent than using the Normal Equation.
Please refer to this link for better understanding:http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentvariants
3.Finding θ using stochastic gradient descent:
The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large.On the other hand,Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance.This makes stochastic gradient descent algorithm much faster as it only considers a single instance at every iteration.
Hence, it is better to use stochastic gradient descent for huge training sets.
On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down . So once the algorithm stops, the final parameter values are good, but not optimal.Hence we may use mini-gradient descent method.
Please refer to this link for better understanding:http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentvariants
4.Finding θ using Mini-batch gradient descent method.
Instead of using a single instance or the entire dataset for updating θ, we compute the gradients on small random sets called mini-batches.
Please refer to this link for better understanding:http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentvariants

Summary:
- The computational complexity of normal equation is very high when the number of parameters are more.
- Though,batch gradient descent gives us optimum solution,it iterates over the entire data set and hence takes a lot of time to reach the optimum value.
- Though Stochastic gradient descent algorithm is much faster than batch gradient descent, it gives us answers close to the optimum value.
- Mini-batch gradient descent has the advantages of both batch and stochastic gradient descent and gives answers that are much closer to the optimum value than those given by stochastic gradient descent.


References:
Book:Hands on machine learning with scikit-learn and tensorflow.
