Understanding Linear Regression

Anuj Vyas

Follow

Published in

Analytics Vidhya

7 min readJul 12, 2020

--

Introduction

This article attempts to be the reference you need when it comes to understanding the Linear Regression algorithm using Gradient Descent. Although this algorithm is simple, only a few truly understand it’s mathematics and underlying principles.

I hope this article finds its way to your bookmarks! For now, let’s dive deep into it!

What is meant by Linear Regression?

Linear means in a particular line and Regression means a measure of the relationship hence Linear Regression is a linear relationship of the data (independent variable) with the output (target variable).

Types of Linear Regression

Simple Linear Regression
Multiple Linear Regression

Working for Simple Linear Regression

Simple Linear Regression is used for finding the relationship between two continuous variables i.e. finding the relationship between the independent variable (predictor) and the dependent variable (response).

The crux of the Simple Linear Regression algorithm is to obtain a line that best fits the data. This is done by minimizing the loss function. What is the loss function? will be discussed later in this blog.

Figure 2 shows the equation of the regression line. Where c is the y-intercept, m is the slope of the line with respect to independent feature x and y is the predicted value (also denoted as ŷ).

Figure 2: Regression Line Equation for Simple Linear Regression

What does ‘m’ denote?

If m > 0, then X (predictor) and Y (target) have a positive relationship. This means the value of Y will increase with an increase in the value of X.
If m < 0, then X (predictor) and Y (target) have a negative relationship. This means the value of Y will decrease with an increase in the value of X.

What does ‘c’ denote?

It is the value of Y when X=0. Suppose, if we plot a graph in which the X-axis consists of Years of Experience (independent feature) and Y-axis consists of Salary (dependent feature). For Years of Experience = 0 what will be the Salary, this is what is denoted by ‘c’.

Now that you have understood the theory about the regression line, let’s discuss how can we select the best-fit regression line for a particular model using loss functions.

What is a Loss function?

Figure 3: Simple Linear Regression with Loss

The loss function is the function that computes the distance between the current output of the algorithm and the expected output. It’s a method to evaluate how your algorithm models the data. It can be categorized into two groups. One for classification (discrete values, 0,1,2…) and the other for regression (continuous values).

The terms cost and loss functions almost refer to the same thing. The cost function is calculated as an average of the loss function. The loss function is a value which is calculated at every instance. So, for a single training cycle loss is calculated numerous times, but the cost function is only calculated once.

Note: For algorithms relying on Gradient descent to optimize model parameters, every loss function selected has to be differentiable.

Consider the loss function to be Mean Squared Error (MSE) in our case. Figure 4 shows the formula of this loss function where n is the number of samples in the dataset, Y is the actual value and Ŷ is the corresponding predicted value for iᵗʰ data point.

Figure 4: Formula for Mean Squared Error

What is Gradient Descent?

Gradient descent is an iterative optimization algorithm to find the minimum of a function. Here, that function refers to the Loss function.

Also on using Gradient descent, the term learning rate comes into the picture. The learning rate is denoted by α and this parameter controls how much the value of m and c should change after each iteration/step. Selecting the proper value of α is also important as shown in Figure 5.

Starting with an initial value of m and c as 0.0 and setting a small value for the learning rate e.g. α=0.001, for this value we calculate the error value using our loss function. For different values of m and c, we will get different error values as shown in Figure 6.

Figure 6: Working of Gradient Descent (Simple Linear Regression)

Once the initial values are selected, we then find the partial derivative of the loss function by applying the chain rule.

Figure 7: Partial derivatives (Slopes) of loss function w.r.t m and c

Once the slope is calculated, we now update the value of m and c using the formula shown in Figure 8.

If the slope at the particular point is negative then the value of m and c increases and the point shifts towards the right side by a small distance as seen in Figure 6.
If the slope at the particular point is positive then the value of m and c decreases and the point shifts towards the left side by a small distance.

The Gradient descent algorithm keeps on changing the value of m and c until the loss becomes very small or becomes 0 (ideally) and this is how we find our Best Fit Regression Line.

Graphical representation of Gradient descent || Source

Now that you have understood the complete working of the Simple Linear Regression algorithm, let’s see how the Multiple Linear Regression algorithm works.

Overview of Multiple Linear Regression

Figure 9: Multiple Linear Regression (Graph)

In real-life scenarios, there will never be a single feature that predicts a target. Hence, we simply perform multiple linear regression.

The equation in Figure 10 shows is very similar to the equation for simple linear regression; simply add the number of independent features/predictors and their corresponding coefficients.

Figure 10: Multiple Linear Regression (Regression Plane Equation)

Note: The working of the algorithm remains the same, the only thing that changes is the Gradient Descent graph. In Simple Linear Regression, the gradient descent graph was in 2D form, but as the number of independent features/predictors increases, the gradient descent graph’s dimensions also keep on increasing.

Figure 11 shows a Gradient Descent graph in a 3D format where A is the initial weight/starting point and B is the global minima.

Figure 11: Gradient Descent (Multiple Linear Regression)

Figure 12 shows the complete working of the Gradient Descent in 3D format from A to B.

Figure 12: Working of Gradient Descent (Multiple Linear Regression)

Assumptions of Linear Regression

The following are the fundamental assumptions of Linear Regression, which can be used to answer the question of whether we can use a linear regression algorithm on a particular dataset?

A linear relationship between features and the target variable:
Linear Regression assumes that the relationship between independent features and the target is linear. It does not support anything else. You may need to transform data to make the relationship linear (e.g. log transform for an exponential relationship).

Little or No Multicollinearity between features:
Multicollinearity exists when the independent variables are found to be moderately or highly correlated. In a model with correlated variables, it becomes a tough task to figure out the true relationship of predictors with the target variable. In other words, it becomes difficult to find out which variable is actually contributing to predict the response variable.

Little or No Autocorrelation in residuals:
The presence of correlation in error terms drastically reduces the model’s accuracy. This usually occurs in time series models where the next instant is dependent on the previous instant. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error.

No Heteroscedasticity:
The presence of non-constant variance in the error terms results in heteroscedasticity. Generally, non-constant variance arises in the presence of outliers. Looks like, these values get too much weight, thereby disproportionately influences the model’s performance.

Normal distribution of error terms:
If the error terms are not normally distributed, confidence intervals may become too wide or narrow. Once the confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on the least-squares. The presence of non-normal distribution suggests that there are a few unusual data points that must be studied closely to make a better model.