All you need to know about Linear Regression (Theory)

5 min readDec 21, 2022

Linear Regression is a Machine Learning (ML) algorithm used to estimate the relationship between quantitative variables. It is based on the concept of finding a linear relationship between the input variables and the output variable.

We can use the Linear Regressor to:

Understand the strength of the relationship between the dependent and independent features.
Predict the value of the dependent variable for a given value of the independent variable.

Examples of Linear Regression:

Let’s say we have a dataset with only two columns (height and weight). The aim is to create a model that takes height as input and predicts the weight.
Create a model that takes the number of rooms as input and predicts the price of the house.

Graphical Intuition of Linear Regression

The main aim of Linear Regression is to find the best fit line in such a way that the sum of the difference between the real point and the predicted point is minimal. The difference between the predicted and the real points are called residuals or errors.

Here, in the above graph, the blue points are predicted points, and the difference between predicted points and real points (points on the red line) are called residuals or errors. Linear Regression tries to reduce these errors or residuals. So in linear regression, the sum of errors or residuals should be minimal.

Equation of a Straight Line

To understand linear regression, it is important to first understand the concept of linear equations or equations of a straight line.

y = mx + b, m is the slope and b is the intercept.
It can also be written as and in most of the research papers it is written as:

Intercept: In the best fit line, when the value of the independent variable (x) is equal to zero, the point where the best fit line meets the y-axis is called the intercept.
Slope: It tells us with the unit movement in the independent variable (x) what is the movement in the dependent variable (y)

Cost Function

Linear regression’s cost function is used to evaluate the model’s ability to predict the dependent variable based on the independent variable’s values. It also helps find the optimal values for the model’s parameters (slope and intercept) that minimize the errors, which are the differences between the predicted and actual values of the dependent variable. The cost function is a measure of the model’s performance.

Example: Let’s say we have a dataset with only two columns Height and Age. We want to predict the heights of people based on their ages. We can use linear regression by fitting a line to the data.

The cost function would then be used to measure the difference between the predicted heights and the actual heights of the people. The goal would be to find the values of the model’s parameters that result in the smallest difference between the predicted and actual heights. θ0 and θ1 are parameters.

Gradient Descent

As mentioned, the cost function is used to measure the difference between the predicted and the actual points or residual so we need to reduce the cost function. This can be done using Gradient Descent.

Gradient Descent is an optimization algorithm used to find the values of the model’s parameters (like slope and intercept) that minimizes the errors between predicted and actual data points. It works by starting with the initial values of the parameters and then adjusting them in the direction that minimizes the error or in the direction of global minima. We need to come near the global minima point or minimum cost point only then the cost function will be reduced.

Source Image1: Saugatbhattarai, Source Image2: Javapoint

Example: Suppose you are standing at the top of the mountain and trying to find the quickest way to the bottom. You could take a direct path straight down the mountain, this might not be the quickest and safest route or you might choose to take a series of smaller steps, each time adjusting your direction slightly based on the terrain in front of you. This is similar to how gradient descent works.

Learning Rate

The learning rate decides the speed of convergence and is used to determine the step size taken by optimizing the algorithm when adjusting the model’s parameters. If the learning rate is too high it will cause drastic updates and convergence may not happen. If the learning rate is too small it will take a huge amount of time to reach convergence.

Cost Functions

Mean Squared Error (MSE): It is calculated by taking the sum of the squared difference between predicted and actual points and then dividing it by the total number of observations. It is given by the formula:

Root Mean Squared Error (RMSE): It is given by taking the square root of MSE.

Mean Absolute Error (MAE): The formula of MAE is very similar to MSE but instead of taking the square root of the difference between actual and predicted values we take the absolute value of the same. It is given by the following formula.