Ridge Regression with Multicollinearity in Pyhton

Published in

Analytics Vidhya

8 min readApr 27, 2019

Definition Ridge Regression

Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity, The particular kind used by ridge regression is known as L2 regularization . In ridge regression, the penalty is the sum of the squares of the coefficients. L2 Regularization aka Ridge Regularization — This add regularization terms in the model which are function of square of coefficients of parameters. Coefficient of parameters can approach to zero but never become zero and hence.

Multicollinearity

Multicollinearity, or collinearity, is the existence of near-linear relationships among the independent variables.

Effects of Multicollinearity

Multicollinearity can create inaccurate estimates of the regression coefficients, inflate the standard errors of the
regression coefficients, deflate the partial t-tests for the regression coefficients, give false, nonsignificant, p-values, and degrade the predictability of the model (and that’s just for starters).

Sources of Multicollinearity

To deal with multicollinearity, you must be able to identify its source. The source of the multicollinearity impacts the analysis, the corrections, and the interpretation of the linear model. There are five sources (see Montgomery [1982] for details):

Data collection. In this case, the data have been collected from a narrow subspace of the independent
variables. The multicollinearity has been created by the sampling methodology — it does not exist in the
population. Obtaining more data on an expanded range would cure this multicollinearity problem. The
extreme example of this is when you try to fit a line to a single point.
Physical constraints of the linear model or population. This source of multicollinearity will exist no
matter what sampling technique is used. Many manufacturing or service processes have constraints on
independent variables (as to their range), either physically, politically, or legally, which will create
multicollinearity.
Over-defined model. Here, there are more variables than observations. This situation should be avoided.
Model choice or specification. This source of multicollinearity comes from using independent variables
that are powers or interactions of an original set of variables. It should be noted that if the sampling
subspace of independent variables is narrow, then any combination of those variables will increase the
multicollinearity problem even further.
Outliers. Extreme values or outliers in the X-space can cause multicollinearity as well as hide it. We call
this outlier-induced multicollinearity. This should be corrected by removing the outliers before ridge
regression is applied

Detection of Multicollinearity

There are several methods of detecting multicollinearity. We mention a few.

Begin by studying pairwise scatter plots of pairs of independent variables, looking for near-perfect relationships. Also glance at the correlation matrix for high correlations. Unfortunately, multicollinearity does not always show up when considering the variables two at a time.
Consider the variance inflation factors (VIF). VIFs over 10 indicate collinear variables.
Eigenvalues of the correlation matrix of the independent variables near zero indicate multicollinearity. Instead of looking at the numerical size of the eigenvalue, use the condition number. Large condition indicate multicollinearity.
Investigate the signs of the regression coefficients. Variables whose regression coefficients are opposite in sign from what you would expect may indicate multicollinearity.

Correction for Multicollinearity

Depending on what the source of multicollinearity is, the solutions will vary. If the multicollinearity has been created by the data collection, collect additional data over a wider X-subspace. If the choice of the linear model has increased the multicollinearity, simplify the model by using variable selection techniques. If an observation or two has induced the multicollinearity, remove those observations. Above all, use care in selecting the variables at the outset. When these steps are not possible, you might try ridge regression.

Ridge Regression Models
Following the usual notation, suppose our regression equation is written in matrix form as

where Y is the dependent variable, X represents the independent variables, B is the regression coefficients to be estimated, and e represents the errors are residuals.

Hand on to Ridge Regression

import mglearn
from sklearn.model_selection import train_test_split
X, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Out:

Training set score: 0.89
Test set score: 0.75

Ridge is a more restricted model, so this model overfit. A less complex model means worse performance on the training set, but if you over complex model, this is bad because can overfitt. How much importance the model places on simplicity versus training set performance can be specified by the user, using the alpha parameter. In the previous example, we used the default parameter alpha=1.0 . There is no reason why this will give us the best trade-off, though.
The optimum setting of alpha depends on the particular dataset we are using.
Increasing alpha forces coefficients to move more toward zero, which decreases training set performance but might help generalization. For example:

ridge10 = Ridge(alpha=10).fit(X_train, y_train)print("Training set score: {:.2f}".format(ridge10.score(X_train, y_train)))print("Test set score: {:.2f}".format(ridge10.score(X_test, y_test)))

Out:

Training set score: 0.79
Test set score: 0.64

Decreasing alpha allows the coefficients to be less restricted. For very small values of alpha , coefficients are barely restricted at all

ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)print("Training set score: {:.2f}".format(ridge01.score(X_train, y_train)))print("Test set score: {:.2f}".format(ridge01.score(X_test, y_test)))

Out:

Training set score: 0.93
Test set score: 0.77

We can get a more qualitative insight into how the alpha parameter changes the model by inspecting the coef_ attribute of models with different values of alpha . A higher alpha means a more restricted model, so we expect the entries of coef_ to have smaller magnitude for a high value of alpha than for a low value of alpha .for out comparision we use Linear Regresssion. This is confirmed in the plot in:

from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(X_train, y_train)import matplotlib.pyplot as pltplt.plot(ridge.coef_, 's', label="Ridge alpha=1")
plt.plot(ridge10.coef_, '^', label="Ridge alpha=10")
plt.plot(ridge01.coef_, 'v', label="Ridge alpha=0.1")
plt.plot(lr.coef_, 'o', label="LinearRegression")plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.hlines(0, 0, len(lr.coef_))
plt.ylim(-25, 25)
plt.legend()

Out:

Comparing coefficient magnitudes for ridge regression with different values
of alpha and linear regression

Here, the x-axis enumerates the entries of coef_ : x=0 shows the coefficient associated with the first feature, x=1 the coefficient associated with the second feature, and so on up to x=100 . The y-axis shows the numeric values of the corresponding values of the coefficients. The main takeaway here is that for alpha=10 , the coefficients are mostly between around –3 and 3. The coefficients for the Ridge model with alpha=1 are somewhat larger. The dots corresponding to alpha=0.1 have larger magnitude still, and many of the dots corresponding to linear regression without any regularization (which would be alpha=0 ) are so large they are outside of the chart.

Another way to understand the influence of regularization is to fix a value of alpha but vary the amount of training data available. For Figure 2, we subsampled the Boston Housing dataset and evaluated LinearRegression and Ridge(alpha=1) on subsets of increasing size (plots that show model performance as a function of dataset size are called learning curves):

mglearn.plots.plot_ridge_n_samples()

figure 2. Learning curves for ridge regression and linear regression on the Boston
Housing dataset

As one would expect, the training score is higher than the test score for all dataset sizes, for both ridge and linear regression. Because ridge is regularized, the training score of ridge is lower than the training score for linear regression across the board. However, the test score for ridge is better, particularly for small subsets of the data. For less than 400 data points, linear regression is not able to learn anything. As more and more data becomes available to the model, both models improve, and linear regression catches up with ridge in the end. The lesson here is that with enough training data, regularization becomes less important, and given enough data, ridge and linear regression will have the same performance (the fact that this happens here when using the full dataset is just by chance). Another interesting aspect of Figure 2 is the decrease in training performance for linear regression. If more data is added, it becomes harder for a model to overfit, or memorize the data.

Objective = RSS + α * (sum of square of coefficients)

Here, α (alpha) is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients. α can take various values:

α = 0:

The objective becomes same as simple linear regression.

We’ll get the same coefficients as simple linear regression.

2. α = ∞:

The coefficients will be zero. Why? Because of infinite weightage on square of coefficients, anything less than zero will make the objective infinite.

3. 0 < α < ∞:

The magnitude of α will decide the weightage given to different parts of objective.

The coefficients will be somewhere between 0 and ones for simple linear regression.

I hope this gives some sense on how α would impact the magnitude of coefficients. One thing is for sure that any non-zero value would give values less than that of simple linear regression. By how much? We’ll find out soon. Leaving the mathematical details for later, lets see ridge regression in action on the same problem as above.

Objective = RSS + α * (sum of square of coefficients)

Here, α (alpha) is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients. α can take various values:

α = 0:

The objective becomes same as simple linear regression.
We’ll get the same coefficients as simple linear regression.

α = ∞:

The coefficients will be zero. Why? Because of infinite weightage on square of coefficients, anything less than zero will make the objective infinite.

0 < α < ∞:

The magnitude of α will decide the weightage given to different parts of objective.
The coefficients will be somewhere between 0 and ones for simple linear regression.

Objective = RSS + α * (sum of square of coefficients)

Here, α (alpha) is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients. α can take various values:

α = 0:

The objective becomes same as simple linear regression.
We’ll get the same coefficients as simple linear regression.

α = ∞:

The coefficients will be zero. Why? Because of infinite weightage on square of coefficients, anything less than zero will make the objective infinite.

0 < α < ∞:

The magnitude of α will decide the weightage given to different parts of objective.
The coefficients will be somewhere between 0 and ones for simple linear regression.

I hope this will be useful, don’t forget to applaud, if you don’t understand, please comment

reference

Andreas C.Muller and Sarah Guido. 2017. Introduction to machine learning with pyhton

NCSS Statictical software chapter 335, RIDGE REGRESSION

Ridge Regression with Multicollinearity in Pyhton

Objective = RSS + α * (sum of square of coefficients)

Objective = RSS + α * (sum of square of coefficients)

Objective = RSS + α * (sum of square of coefficients)

Written by Imam Muhajir