Linear Regression

ROHITH RAMESH
Analytics Vidhya
Published in
8 min readNov 1, 2019

What is Regression ?

Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest. Regression analysis is a reliable method of identifying which variables have impact on a topic of interest. The process of performing a regression allows you to confidently determine which factors matter most, which factors can be ignored, and how these factors influence each other.

Linear Regression

It is the study of linear, additive relationships between variables. The linear regression model has a dependent variable that is a continuous variable, while the independent variables can take any form (continuous, discrete, or indicator variables).

In order to understand regression analysis fully, it’s essential to comprehend the following terms:

  • Dependent Variable: This is the main factor that you’re trying to understand or predict.
  • Independent Variables: These are the factors that you hypothesize have an impact on your dependent variable.

Simple linear regression :

A simple linear regression model has only one independent variable, while a multiple linear regression model has two or more independent variables. It is useful for finding linear relationship (straight line relationship) between two continuous variables. (Cause and Effect Relationship)

The value b0 and b1 must be chosen so that they minimize the error.

In Linear Regression,
Predictor -> X -> Features.
Predicted -> Y -> Target Variable.
Coefficients -> b0, b1.

Sum of squared error :

It is considered as metric then the goal is to obtain the best line that reduces the error.

If we don’t square the error, then positive and negative points will cancel out each other.

• The rate at which unit changes in X influences Y(predicted)

The coefficients are calculated using OLS (ordinary least square regression) technique. The OLS technique captures the best possible straight line that gives better relationship between the predicted(Y) and the predictor(X) variables. It is the line that is close to as many points as possible. Calculate the distance of each points from the line. The distance, we get from y-axis, some are positive and some are negative. Square and then sum them up. The line with the minimum distance is the best possible line.

· So the line with the least mean squared error is selected. It is also known as Residual sum of squares.

Yi = Training set target value. (W.Xi+b)= predicted target value using model.

python implementation

let us take an example of Boston house prices dataset,

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.data.shape)
boston.keys()

Data

pd.DataFrame(boston.data).head()

Target

pd.DataFrame(boston.target).head()

Feature names and Description

print(boston.feature_names)
print(boston.DESCR)

Creating a dataframe

boston_data = pd.DataFrame(boston.data)
boston_data.columns = boston.feature_names
boston_data['MEDV']=boston.target
print(boston_data.head())

Visualization

import matplotlib.pyplot as plt
plt.hist(boston.target)
plt.title('Boston Housing Prices and Count Histogram')
plt.xlabel('price ($1000s)')
plt.ylabel('count')
plt.show()

The below will create a scatterplot on each of the feature_names versus the price of Boston housing.

import seaborn as snssns.pairplot(boston_data,x_vars=['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT'], y_vars=["MEDV"])
#separating the dependent and independent variables
X = boston_data.drop('MEDV', axis = 1)
Y = boston_data['MEDV']

Least square linear regression in scikit-learn :

x_train, x_test, y_train, y_test = train_test_split(X, Y, random_state = 0)Linreg = LinearRegression().fit(x_train, y_train)print("Linear model intercept(b):{}".format(Linreg.intercept_))
print("Linear model coeff (w):{}".format(Linreg.coef_))
print("R-squared score(training): {:.3f}".format(Linreg.score(x_train,y_train)))
print("R-squared score(test): {:.3f}".format(Linreg.score(x_test,y_test)))

Y=W0.X0+b
W0 = Linreg.coef_ , b=linreg.intercept_

linreg.coef_ and linreg.intercept_ :-underscore denotes a quantity derived from training data, as opposed to a user setting.

y_pred = Linreg.predict(x_test)plt.scatter(y_test, y_pred)
plt.xlabel("Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$")
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print(mse)

This means that the model isn’t a really great linear model

let us now try to improve this ‘R score’ with the help of regularization.

Variations in linear regression :

1. Ridge Linear Regression:

It uses the same “least square criterion” but adds a penalty for large variation in “W” parameter (b1) penalty, in order to avoid over fitting and complex model.
Penalty parameter is called regularization
Ridge regression uses -> L2 regularization

• The model with larger feature weights (W) will add more to the objective functions overall value.
• Because our goal is to minimize the overall objective function, the regularization term act as a penalty of the model with lots of large feature weight values.
• Higher Alpha means more regularization.

The need for feature normalization :

when the input variables, features have very different scales, then when regularizing the coefficients, input variable with the different scale will have different contributions to this L2 penalty.
And thus, all the input features has to be on the same scale.

Using a scalar object : fit and transform method

from sklearn.preprocessing import MinMaxScaler
Scaler = MinMaxScaler()
X_train_Scaled = Scaler.fit_transform(x_train)
X_test_Scaled = Scaler.transform(x_test)
clf = Ridge().fit(X_train_Scaled , y_train)
R2_score = clf.score(X_test_Scaled, y_test)

Ridge regression with regularization parameter: alpha

from sklearn.linear_model import Ridgeprint('Ridge regression : effect of alpha regularization parameter \n')
for this_alpha in [0,1,10,20,50,100,1000]:
linridge = Ridge(alpha = this_alpha).fit(X_train_Scaled,y_train)
r2_train = linridge.score(X_train_Scaled, y_train)
r2_test = linridge.score(X_test_Scaled, y_test)
num_coeff_bigger = np.sum(abs(linridge.coef_)>1.0)
print('Alpha = {:.2f}\n numabs(coeff) >1.0:{}, r->squared training: {:.2f}, r_squared test: {:.2f}\n'.format(this_alpha,num_coeff_bigger,r2_train,r2_test))

What is Ridge regression doing?

Its regularizing the linear regression by imposing sum of squares penalty on the size of ‘w’ coefficients. So the effect of alpha is to shrink the ‘w’ coeffients to zero and towards each other.

2. Lasso Regression:

It is another form of regularized linear regression that uses an L1 Regularization.
L1-penalty : Minimize the sum of the absolute value of the coefficients.

This has the effect of setting parameter weights in ‘w’ to zero. For the least influential variables. This is called sparse solution. This is a kind of feature selection.

from sklearn.liner_model import Lasso
from sklearn.preprocessing import MinMaxScaler
Scaler = MinMaxScaler()
X_train_Scaled = Scaler.fit_transform(x_train)
X_test_Scaled = Scaler.transform(x_test)
clf = Ridge().fit(X_train_Scaled , y_train)
R2_score = clf.score(X_test_Scaled, y_test)
linlasso = Lasso(alpha=0.0001, max_iter=10e5).fit(X_train_Scaled,y_train)
print('lasso regression linear model intercept:{}'.format(linlasso.intercept_))
print('lasso regression linear model coeff:\n{}'.format(linlasso.coef_))
print('Non-zero feature :{}'.format(np.sum(linlasso.coef_!=0)))
print('R-Squared score(training):{:.3f}'.format(linlasso.score(X_train_Scaled, y_train)))
print('R-Squared score(test):{:.3f}\n'.format(linlasso.score(X_test_Scaled,y_test)))

When to use Ridge vs Lasso Regression?
• Many small/medium sized effects: Use Ridge
• Only few variables with medium/large effects: Use Lasso

3. Polynomial Feature with Linear Regression :

•Generate new features consisting of all polynomial combinations of the original two features(X0, X1).
• The degree of the polynomial specifies how many variables participate at a time in each new feature.(our eg: degree 2)
• This is still a weighted linear combination of features, so it is still a linear model and can use same least-squares elimination method for ‘w’ and ‘b’(coefficients).

Why should we transform our data?
• To capture interactions between the original features by adding them as features to the linear model.
• To make a classification problem easier.

For Example: It may be that housing prices vary as a quadratic function of both the land price that a house sits on and the amount of taxes paid on the property as a theoretical example.
A simple linear model could not capture this non-linear relationship, but by adding non-linear features like polynomial, to the linear regression model, we can capture this non-linearity.

• Beware of the polynomial feature expansion with higher degree as this can lead to complex models that over fit.
• Thus, polynomial feature expansion is often combined with a regularized learning method like ridge regression.

Now run all three simultaneously and check:

from sklear.linear_model import LinearRgression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
Poly = PolynomialFeatures(degree =2)
X_F1_poly = Poly.fit_transform(X)
x_poly_train, x_poly_test, y_train, y_test = train_test_split(X_F1_poly, Y, random_state = 0)
linreg = LinearRegression().fit(x_poly_train, y_train)
print('(poly deg 2) Linear model coeff(W):\n{}'.format(linreg.coef_))
print('(poly deg 2) Linear model intercept(b):{:.3f}'.format(linreg.intercept_))
print('(poly deg 2) R-Squared score (training):{:.3f}'.format(linreg.score(x_poly_train, y_train)))
print('(poly deg 2) R-Squared score (test):{:.3f}'.format(linreg.score(x_poly_test, y_test)))

Addition of many polynomial features often leads to overfitting, so we often use polynomial feature in combination with regression that has a regularization penalty, like ridge regression.

Ridge regression avoids over fitting:

Poly = PolynomialFeatures(degree =2)
X_F1_poly1 = Poly.fit_transform(X)
x_poly_train1, x_poly_test1, y_train, y_test = train_test_split(X_F1_poly1, Y, test_size=0.4, random_state = 0)
linrid = Ridge(alpha = 0.001).fit(x_poly_train1, y_train)
print('(poly deg 2) Linear model coeff(W):\n{}'.format(linrid.coef_))
print('(poly deg 2) Linear model intercept(b):{:.3f}'.format(linrid.intercept_))
print('(poly deg 2) R-Squared score (training):{:.3f}'.format(linrid.score(x_poly_train1, y_train)))
print('(poly deg 2) R-Squared score (test):{:.3f}'.format(linrid.score(x_poly_test1, y_test))))

As we can see from above outputs , polynomial features with ridge regression has given the best possible output without over fitting.

1.The First Regression: Just uses least square regression without polynomial feature transformation.
2. The Second Regression: creates the polynomial features object with degrees set to 2 and then calls the ‘fit_transform’ method of the polynomial features. The code then calls ordinary least square linear regression.
We can see indications of over fitting on this expanded feature representation, as the models R-squared score on the training set is close to 1 but much lower on the test set.
3. The Third regression: Training and Test set are basically same. With the test score of the regularized polynomial regression performing the best of all three regression.

There are other evaluation metrics as well. But in this article I have mainly concentrated on R Squared score and MSE(Mean Squared Error) to explain linear regression and the variations of linear regression and their effect on accuracy and to avoid over fitting.

For the implementation codes, kindly refer “https://github.com/rohithramesh1991/Linear-Regression

Thank you for reading…

--

--

ROHITH RAMESH
Analytics Vidhya

Keep Developing Your Skills and Encourage Data-Driven Culture