All you need to know about Linear Regression and Gradient Descent in 7 minutes..

Parichay Pothepalli
6 min readOct 26, 2022

--

This is the first deep-dive into the course Machine_Learning_A_to_Z Series: which focuses primarily on the Supervised Learning task of Regression.Please Note 📝:- This series can be effectively used as a quick refresher for Data Science Interview preparations as we go from ground-up to Intermediate level concepts.The series of blogs aims🎯 at a fundamental and conceptual level understanding of various ML algorithms and does not emphasize a lot on code implementations as there are various amazing resources out there to practice on real-time data. Let's get started..

Linear Regression

Fig 1: A basic representation of a Regression line

The above image demonstrates a fitted line of Regression. Linear Regression is the model that tries to find a linear/ direct solution of how two parameters interact with each other and understand their relationship mathematically. This allows us to predict the value of ‘y’ given any ‘x’.

Note :- Regression tasks are used for continuous variables i.e the values that the variable can take are infinite.

A sneak peak of the data :

Fig 2: Insurance dataset where x: [age, sex…]; y: [charges]

The intuition is to fit a line amidst all the datapoints that reduces distance between the fitted line and the original datapoints as shown in Fig 1. Our goal is to find a single line that is representative of the underlying data.

The different ways of Linear Regression:-

  1. Direct-closed form equation :- This equation of y=mx +c allows us to directly compute the linear relationship by minimizing the cost function over the training set.
Fig 3: A general regression line, y=β1* x +β0 where, β0=-2.29 and β1= 4634.4

Interpretation of the Regression: ( The most important yet neglected part of regression is understanding what the coefficients in the equation describe about the output variable/ dependent variable ‘y’)

Assume the above regression is how age describes charges for each individual. An increase in the ‘age’ by 1 will lead to an increase in the ‘charges’ by $4624. We shall discuss R² in detail in the upcoming blogs. For now, we can think of R² as how well ‘age’ is able to describe ‘charges’. Variation in ‘age’ describes 70.2% of the variation in ‘charges’.

Note 📝:- Cost function is a function that tries to reduce the error ( difference between the original datapoints and the fitted line).

  1. The β(0) term is what is called as the bias term/intercept term. β(1) is the coefficient of ‘x’ which is the input variable to predict ‘y’ which is the output variable. It is always in the form of y=mx +c.
  2. In order to understand the performance measure of a regression model we use RMSE( Root Mean squared error).

🎯 :- The main aim of the linear regression is to find a value of β(0) that minimizes the RMSE.

RMSE: is a frequently used measure of the differences between values predicted by a model (Y’) or an estimator and the values observed (Y).

Fig 4: RMSE-Performance Evaluation metric

In order to find the closed form solution we use the normal equation, which we shall not cover as it is not widely used. An implementation of the same can be done by using the LinearRegression module from scikitlearn.linear_model

Limitation:- The computational complexity of the scikit-learn’s LinearRegression is O(n²) which is generally not considered scalable code. Hence we look to other solutions such as the Stochastic Gradient Descent.

2. Gradient Descent:

The idea of a gradient descent is to tweak parameters iteratively in order to minimize a cost function which is usually MSE/MAE for regression and Cross_entropy for classification problems.

Major steps involved:-

  1. Start filling the vector with random values (Random initialization). Then you improve it gradually until the algorithm converges at the minimum.
  2. An important parameter for Gradient Descent is learning rate. This defines the number of steps the algorithm takes to converge at a minimum.
Fig 5: Random initialization followed by decreasing error with each learning step.

In simple words, a normal closed form Linear regression which iteratively optimizes the solution at each step by keeping in mind a cost function (MSE/RMSE/MAE) in our case.

Two problems you may face when applying Gradient descent:

  1. learning rate is too small:
Fig 6: A low learning rate

We see that a very low learning rate doesn't allow us to reach the global minima and we get a lesser optimized solution.

2. learning rate is too high:

Fig 7: A high learning rate

We see that a very high learning rate allows the cost function to bounce around randomly which does not actually stop at the global minima.

Note:- Gradient descent requires scaling as it is very sensitive and gets affected by outliers.

3. Batch Gradient Descent:

We have always been taking partial derivatives individually for each datapoint ( i.e the optimization for SGD — Stochastic Gradient Descent), however if we use the whole batch of the data for each training step what we get is Batch Gradient Descent.

For obvious reasons the algorithm takes a lot more time than normal Gradient Descent.

A hyper parameter that has to be tuned as per the dataset is learning rate. We can try finding the ideal hyper parameter using Grid search which shall be discussed in future blogs.

When to stop training?

A very important question is when to stop fitting the data i.e control the number of steps that we train the entire batch on.

A simple way is to use callbacks which allow us to train on a large number of iterations but stop when the gradient is extremely low, indicating convergence to global minima.

4. Stochastic Gradient Descent:

Now imagine a normal Linear regression equation where we use Gradient descent as a measure of convergence, however we do not train the entire batch as is the case with Batch Gradient Descent.

SGD picks up a random instance from the training data at each step and computes the gradients. The randomness can be good as it helps escape local minima but can be bad as we may be lost in translation and can take significant bouncing off to converge.

Fig 8: A comparison between Gradient Descent and Stochastic Gradient Descent

As the figure above shows the right plot has so much bouncing off but still ends up to the same solution as Gradient Descent.

A good rule of thumb is to have a very high learning rate at the start and then gradually decrease it to ensure lesser bouncing off as we converge.

Implementation: We can implement Stochastic GD using the SGDRegressor class which defaults to RMSE as it’s default performance evaluation metric.

I have tried to keep the first article extremely short and crisp. Hope this article has given you a good idea about what Linear Regression ( with one variable) is and the different ways to train a Linear Regression model.

The next article focuses on the 4 most overlooked yet important assumptions of Linear Regression.

Thanks for reading this article and stay tuned for more content. If you like the content please do drop in a clap/ comment. Any scope for improvement / feedback would be highly appreciated.

--

--

Parichay Pothepalli

Data Scientist who believes that AI/ Deep Learning is here to change the way we view the world. Join me in this journey of growth, sharing experiences.