Linear Regression In Depth

5 min readAug 24, 2022

Linear Regression is the fundamental of Machine learning. In this blog, I try to cover this algorithm along with Gradient descent in depth.

What is Regression?

Regression is a problem of predicting a target numeric value, such as the price of a car, given a set of features (mileage, age, brand, etc.). To train the system, you will need data for both features and labels(prices).

Types of Regression Algorithms

Linear Regression is a popular regression learning algorithm that learns a model which is a linear combination of features of the input example. Some regression algorithms are used for classification like Logistic regression as it can output the probability of belonging to a given class.

Linear Regression

The first step to any machine learning problem is model selection. Say as per your data, you see a trend where price of the car goes up when mileage(considering one attribute for now) of the car increases. So you decide to model price of a car as a linear function of mileage.

Second step is building or training the model. We want to build a model y or fw,b(x) as a linear combination of features of example x:

where x is D-dimensional feature vector of examples i=1,…N (N is the size of the collection), y is the target, w is D-dimensional vector of parameters and b is a real number.

We will use the model to find y, for a given x. To find y, we first need to find optimal values of w* and b*. The optimal values of the parameters define the model that makes the most accurate prediction.

To find optimal w* and b*, we minimize the below objective function, also called cost function or mean square error function. It is the average of all penalties or loss(y’-y) obtained by applying the model to the training data. The idea is to minimize the distance between linear model’s predictions and training samples to find the best line of regression.

We need to find the best line of regression that is the Minimum distance between predicted samples and training samples

Note: Why is the loss a quadratic function or why not the absolute value between true target and predicted value or why not a cube instead of square.? The answer is squaring the error before the sum is convenient, it has a continuous derivate which makes the function smooth. Smooth functions are easy for finding closed form solution to optimization problems.

Now, to learn the weight and bias, say you can start with a guess and iteratively adjust the guess until you find weights and bias with minimum loss. If you try to plot all values of loss for every value of w1, you will see the plot as convex/bowl shaped. This implies there is only one minimum i.e. one place where the slope is 0 [Regression problems yield convex loss vs weight plots]. To find this convergence point, calculating loss for every value of w1 will be an inefficient way and that’s why we need Gradient descent.

Gradient descent is an iterative optimization algorithm for finding the minimum of a function. Kindly note that gradient descent is not a machine learning algorithm. It is a solver of a minimization problem in which the function to minimize has a gradient(slope of a curve) or a derivate or is differentiable. Below are the major steps:

It starts with an arbitrary point(initialization) say w =0 and b=0.
It then calculates partial derivate or gradient for every parameter. We can find partial derivates of the loss function with help of chain rule.

3. It proceeds in epochs. An epoch consists of using the training set entirely to update each parameter. At each epoch, we update w and b using partial derivates. Learning rate α controls the size of an update.

We subtract partial derivates from the value of parameters because derivates are indicators of growth of a function. If the derivate is +ve, then the function grows at this point, so for minimizing we move to the left and vice versa (negative derivate means function is decreasing and we need to move to the right).

4. At the next epoch, we recalculate partial derivate using the above equation with updated values of w and b; we continue the process until convergence.

5. When we start to see the values for w and b don’t change much after each epoch; we stop.

Learning rate is the hyperparameter here. Small value of it will make training process slow and a higher value might overshoot the minima.

For implementation of gradient descent code from scratch, refer https://www.geeksforgeeks.org/linear-regression-implementation-from-scratch-using-python/

Gradient descent is sensitive to the choice of α (learning rate) and is also slow for large datasets. Hence, variations like stochastic gradient descent (SGD) and minibatch stochastic gradient descent (Mini-batch SGD) and others have been proposed. SGD uses only a single example (a batch size of 1) per iteration. Mini-batch SGD is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random.

ML engineers don’t implement gradient descent, instead use open source libraries like below Scikit learn Linear regression model :

import sklearn.linear_model

model=sklearn.linear_model.LinearRegression().fit(X_train,y_train)

Now you know what this one line of code is doing behind the scenes!

Reference:

The Hundred-Page ML book
Hands on Machine Learning with Scikit Learn, Keras & Tensorflow
Google ML crash course
Geeks for Geeks
https://www.javatpoint.com/gradient-descent-in-machine-learning for gradient descent diagram

Linear Regression In Depth

Written by Deeksha Goplani