Intuition and Implementation of Linear Regression

Sachin D N

Published in

Analytics Vidhya

9 min readSep 9, 2020

Geometric Intuition for Linear Regression

2. Linear Regression using Loss-Minimization

3. Assumptions of Linear Regression

4. Implementation of the Linear Regression using Python

What is Regression?

Regression analysis is a form of predictive modeling technique that investigates the relationship between a dependent and independent variable.

Geometric Intuition for Linear Regression

Linear regression is perhaps one of the most well known and well-understood algorithms in statistics and machine learning. Linear regression was developed in the field of statistics and is studied as a model for understanding the relationship between input and output numerical variables, but with the course of time, it has become an integral part of the modern machine learning toolbox.

Let’s consider the following image below:

So, in the above image, X is the set of values that correspond to the living areas of various houses and y is the price of the respective houses but note that these values are predicted by h. h is the function that maps the X values to y (often called as a predictor). For historical reasons, this h is referred to as a hypothesis function. Keep in mind that, this dataset has only featured, i.e., the living areas of various houses, and consider this to be a toy dataset for the sake of understanding.

Linear Regression is all about finding a line (or) plane that fits the given data as well as possible.

y=mx+b, Here m is the slope of the line and b is the y-intercept. it is the equation similar to algebra. But in statistics, the points do not lie perfectly on a line. it models around which the data lie if a strong line pattern exists.

what is the best fit?

It is the minimize the sum of errors of all the points across our training data.

Mathematical Formulation

The line seen in the graph is the actual relationship we going to accomplish, And we want to minimize the error of our model. This line is the best fit that passes through most of the scatter points and also reduces error which is the distance from the point to the line itself as illustrated below.

The below image shows the actual value and predicted value for the given points in the dataset.

when we consider the error it will come positive and negative, so we want to take the square

Linear Regression is also called Ordinary Least Square (OLS) (or) Linear Least Square method.

It’s a linear model and minimizes the square of errors.

The function of Linear Regression is given by

The final optimization problem is given by

We can also use regularization methods work by penalizing the coefficients of features having extremely large values and thereby try to reduce the error. It not only results in an enhanced error rate but also, reduces the model complexity. This is particularly very useful when you are dealing with a dataset that has a large number of features, and your baseline model is not able to distinguish between the importance of the features.

There are two variants of regularization procedures for linear regression are:

Lasso Regression: adds a penalty term which is equivalent to the absolute value of the magnitude of the coefficients (also called L1 regularization). The penalty terms look like:

Ridge Regression: adds a penalty term which is equivalent to the square of the magnitude of coefficients (also called L2 regularization). The penalty terms look like:

λ is the constant factor that you add in order to control the speed of the improvement in error (learning rate).

How to solve the above optimization problem?

First, find the derivative of the loss function shown in the below image.

The derivative of the above Loss-Function is given by

Let’s use a search algorithm that starts with some “initial guess” for weight W, and that iteratively changes W to make Wj smaller, until hopefully, you converge to a value of W that minimizes wj. Specifically, let’s consider the gradient descent algorithm, which starts with some initial weight W, and repeatedly performs the update:

(This update is simultaneously performed for all values of Wj= 0, . . . , n.) Here, α is called the learning rate. This is a very natural algorithm that repeatedly takes a step in the direction of the steepest decrease of Wj. This term α effectively controls how steep your algorithm would move to a decrease of Wj.

More briefly speaking, it works by starting with random values for each coefficient. The sum of the squared errors is calculated for each pair of input and output values. A learning rate is used as a scale factor, and the coefficients are updated in the direction towards minimizing the error. The process is repeated until a minimum sum squared error is achieved or no further improvement is possible.

The term α (learning rate) is very important here since it determines the size of the improvement step to take on each iteration of the procedure.

Now there are commonly two variants of gradient descent:

The method that looks at every example in the entire training set on every step and is called batch gradient descent.
The method where you repeatedly run through the training set, and each time you encounter a training example, you update the parameters according to the gradient of the error with respect to that single training example only. This algorithm is called stochastic gradient descent (also incremental gradient descent).

Linear Regression using Loss-Minimization

In the square loss, both side error is equally distributed. The data points moving away from the hyperplane the error is increased.

For Loss-minimization use square loss, we get Linear Regression.

We need to able to measure how good our model is (accuracy). There are many methods to achieve this but we would implement Root mean squared error and coefficient of Determination (R² Score).

The limitation of R-squared is that it will either stay the same or increases with the addition of more variables, even if they do not have any relationship with the output variables.

To overcome this limitation, Adjusted R-square comes into the picture as it penalizes you for adding the variables which do not improve your existing model.

Adjusted R² depicts the same meaning as R² but is an improvement of it. R² suffers from the problem that the scores improve on increasing terms even though the model is not improving which may misguide the researcher. Adjusted R² is always lower than R² as it adjusts for the increasing predictors and only shows improvement if there is a real improvement.

Hence, if you are building Linear regression on multiple variables, it is always suggested that you use Adjusted R-squared to judge the goodness of the model.

Assumptions of Linear Regression

Linear Regression mainly has five assumptions listed below.

Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity

Linear relationship: First, linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots.

Multivariate normality: The linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnov test. When the data is not normally distributed a non-linear transformation (e.g., log-transformation) might fix this issue.

No or little multicollinearity: linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.

How to check?

Using Variance Inflation factor (VIF). But, What is VIF?

VIF is a metric computed for every X variable that goes into a linear model. If the VIF of a variable is high, it means the information in that variable is already explained by other X variables present in the given model, which means, more redundant is that variable. So, the lower the VIF (<2) the better. VIF for an X var is calculated as,

No auto-correlation: This is applicable especially for time series data. Autocorrelation is the correlation of a Time Series with lags of itself. When the residuals are autocorrelated, it means that the current value is dependent on the previous (historic) values and that there is a definite unexplained pattern in the Y variable that shows up in the disturbances.

Homoscedasticity: The linear regression analysis is homoscedasticity. The scatter plot is a good way to check whether the data are homoscedastic (meaning the residuals are equal across the regression line).

The Goldfeld-Quandt Test can also be used to test for heteroscedasticity. The test splits the data into two groups and tests to see if the variances of the residuals are similar across the groups. If homoscedasticity is present, a non-linear correction might fix the problem.

To know more about Linear Regression assumptions visit here.

Implementation of the Linear Regression using Python

Housing Case Study

Problem Statement: Consider a real estate company that has a dataset containing the prices of properties in the Delhi region. It wishes to use the data to optimize the sale prices of the properties based on important factors such as area, bedrooms, parking, etc.

Essentially, the company wants:

To identify the variables affecting house prices, e.g. area, number of rooms, bathrooms, etc.
To create a linear model that quantitatively relates house prices with variables such as the number of rooms, area, number of bathrooms, etc.
To know the accuracy of the model, i.e. how well these variables can predict house prices.

Data Preparation

You can see that your dataset has many columns with values as ‘Yes’ or ‘No’.
We need to convert them to 1s and 0s, where 1 is a ‘Yes’, and 0 is a ‘No’.

One Hot Encoding for Categorical variables

Data Normalization

Data Normalization (Or) Data standardization is mandatory before building the model in Linear Regression.

Data Splitting

Model Building

To understand the full code please visit my GitHub link.

I also Implemented Linear Regression using different Data sets to understand the full code please visit my GitHub link.

References

Applied AI
Wikipedia
Coursera
Data Camp

Thanks for reading and your patience. I hope you liked the post, let me know if there are any errors in my post. Let’s discuss in the comments if you find anything wrong in the post or if you have anything to add…

Happy Learning!!