Linear Regression

Published in

SomX Labs

3 min readFeb 1, 2017

Linear Regression is one of the oldest and the simplest technique that is still used heavily till date. Lets see what is linear regression?

Linear regression is a statistical technique which helps us summarize and analyze relationships between two continuous variables, independent variables and dependent variables.

The input variable is called the predictor variable or the independent variable. Generally denoted by the letter X or x.
The output variable is called the response variable or the dependent variable. Generally denoted by the letter Y or y.

Equation:

Linear Regression is denoted by the below equation:

Y = X * W + B

If you observe carefully it is an equation of a line with slope ‘W’ and intercept ‘B’.

There are two types of linear regressions:

Simple Linear Regression
Multiple Linear Regression

Simple Linear Regression

In simple linear regression there is a single predictor variable and a response variable.

Multiple Linear Regression

In multiple linear regression the input has multiple predictor variables and a response variable.

Simple Linear Regression. Source: Wikipedia

Line of best fit

What is a line of best fit ? Lets find out.

In linear regression we can fit multiple lines to the models among them the line that best summarizes the trend of the data is called as the line of the best fit.

Have a look at the below graph

There are three trend lines which are predicting the trend here. The green, black and the blue, all of them are valid linear models but among them which one is the line of best fit?

The line that minimizes the squared error between the points is the line of the best fit.

In linear regression we define error as the perpendicular distance between the point and the line. The error is squared so that the negative distance does not alter the error metric.

Assumptions of Linear Regression

There are 5 assumptions that are made about the data when developing a linear regression model.

Linear Relationship
Multivariate Normality
No Multicollinearity
No Auto-Correlation
Homoscedasticity

Linear Relationship

The data by data here what we mean is the X/x and Y/y variable, on which the model has to be built has some kind of linear relationship and it should not be random.

Multivariable Normality

The second assumption requires all variables to be multivariate normal. This can be tested using a histogram with a fitted normal curve. The data should be normally distributed. This can be checked using a histogram plot.

No Multicollinearity

When a variable is directly dependent on other variable we say that they are collinear, this intensity of dependence can vary and so the collinearity. The assumption for building a linear model is to have no or very little collinearity among the independent variables.

No Autocorrelation

Autocorrelation occurs when values of the predictor variables are not independent of each other. Say y(x+1) is dependent of y(x). For successfully building linear models one has to make sure there is no autocorrelation in data.

Homoscedasticity

The last assumption is homoscedasticity that is the error values among the regression are equal. In simple words we can say that if all random variables in the input have the same finite variance, then we can say that the data follows the property of homoscedasticity.