Understanding Linear Regression and it’s python implementation

Bhartendu Dubey
Analytics Vidhya
Published in
5 min readJan 5, 2020

What is Linear Regression?

Linear regression is a basic and commonly used type of predictive analysis. It is used for finding linear relationship between target and one or more predictors.The overall idea of regression is to examine two things:

  1. Does a set of predictor variables do a good job in predicting an outcome (dependent) variable?
  2. Which variables in particular are significant predictors of the outcome variable, and in what way do they impact the outcome variable?

These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. The simplest form of the regression equation with one dependent and one independent variable is defined by the formula

y = b1*x + b0,

where

y = estimated dependent variable,

b0 = constant,

b1 = regression coefficient,

x = independent variable.

The General equation for a Multiple linear regression with ‘p’ independent variables looks like:

There are two types of linear regression-

  • Simple
  • Multiple

The difference between simple linear regression and multiple linear regression is that, multiple linear regression has more than one independent variables, whereas simple linear regression has only 1 independent variable.

Simple Linear Regression

Simple linear regression is used for finding relationship between two continuous variables. One is predictor/independent variable and other is response/dependent variable. It looks for statistical relationship but not deterministic relationship(Relationship between two variables is said to be deterministic if one variable can be accurately expressed by the other).

For example, using temperature in degree Celsius it is possible to accurately predict Fahrenheit. Statistical relationship is not accurate in determining relationship between two variables.

e.g. relationship between height and weight.

The core idea is to obtain a line that best fits the data i.e. the total prediction error is as small as possible. (Error = distance between the point to the regression line).

Now, the question is “How do we obtain best fit line?”

we know, Y(pred) = b0 + b1*x

The values b0 and b1 must be chosen so that they minimize the error. If sum of squared error is taken as a metric to evaluate the model, then our goal is to obtain a line that best reduces the error.

Error Calculation

If we don’t square the error, then positive and negative point will cancel out each other.

For model with one predictor,

Calculating Intercept
Co-efficient Formula
  • If b1 > 0, then x(predictor) and y(target) have a positive relationship (increase in x will increase y).
  • If b1 < 0, then x(predictor) and y(target) have a negative relationship (increase in x will decrease y).
  • If the model does not include x (i.e. x=0), then the prediction will become meaningless with only b0.
  • If there is no ‘b0’ term, then regression will be forced to pass over the origin. Both the regression co-efficient and prediction will be biased in this case.

E.g. Consider we have a dataset that relates height(x) and weight(y). Taking x=0(that is height as 0), the equation will have only b0 value which is completely meaningless as in real-time height and weight can never be zero. This resulted due to considering the model values beyond its scope.

Linear regression

Randomness and unpredictability are the two main components of a regression model.

Prediction = Deterministic + Statistic

Deterministic part is covered by the predictor variable in the model.

Stochastic part reveals the fact that the expected and observed value is unpredictable.

There will always be some information that are missed to cover. This information can be obtained from the residual information.

Let’s understand the concept of residue through an example. Consider, we have a dataset which predicts sales of juice when given a temperature of place. Value predicted from regression equation will always have some difference with the actual value. Sales will not match exactly with the true output value. This difference is called as residue.

Characteristics of a residue :

  • Residuals do not exhibit any pattern
  • Adjacent residuals should not be same as they indicate that there is some information missed by system.

Residual plot helps in analyzing the model using the values of residues. It is plotted between predicted values and residue. Their values are standardized. The distance of the point from 0 specifies how bad the prediction was for that value. If the value is positive, then the prediction is low. If the value is negative, then the prediction is high. 0 value indicates prefect prediction. Therefore, detecting residual pattern can improve the model.

Residual analysis

Non-random pattern of the residual plot indicates that the model is,

  • Missing a variable which has significant contribution to the model target
  • Missing to capture non-linearity (using polynomial term)
  • No interaction between terms in model

Applications:

  • It is used in capital asset pricing model for analyzing and quantifying the systematic risk of an investment.
  • Linear regression is the predominant empirical tool in economics.

For example, it is used to predict labor demand, and labor supply.

  • In Canada, the Environmental Effects Monitoring Program uses statistical analyses on fish and benthic surveys to measure the effects of pulp mill or metal mine effluent on the aquatic ecosystem.

Python Implementation

Dataset : The data set contains information about money spent on advertisement and their generated sales (Money was spent on TV, radio and newspaper ads).

link : https://github.com/bhartendudubey/Supervised-Learning-Algorithms/blob/master/dataset.csv

Aim : The objective is to use linear regression to understand how advertisement spending impacts sales.

Image courtesy: shutterstock.com

Here’s the Jupyter notebook for the python implementation of Linear regression.

link: https://github.com/bhartendudubey/Supervised-Learning-Algorithms/blob/master/Linear_regression.ipynb

--

--