Ridge and Lasso Regression : An illustration and explanation using Sklearn in Python

David Sotunbo
7 min readOct 17, 2019

--

Ridge and Lasso Regression

When looking into supervised machine learning in python , the first point of contact is linear regression . It is linear if we are using a linear function of input features.

linear function

The eqn1.1 shown above is an sample linear model with n number of features. The w[0] represents the slope and b is the intercept. The aim of linear regression is to optimize w and b to ensure the cost function is minimal. The cost is given below :

cost function for linear function

According to the equation above we are assuming that we have M instances and a p number of features on the data set in consideration.
To use a linear regression on a data set , we need to divide it into train and test set . Then we calculate the the score on them to determine whether we have a over-fitting or under-fitting . In a case we have very few feature on a data set and score is poor it mostly because we are under-fitting or not generalizing enough. On the other end, if we have a large number of feature and test score is relatively poor than the train score we are over-fitting or over-generalizing.This is where ridge and lasso comes in :
Ridge and Lasso regression are some of the simple techniques to minimize model complexity and circumvent over-fitting which may result from simple linear regression.
Let us take a look at ridge regression

Ridge Regression: When using ridge regression , the cost function is altered by adding a penalty equivalent to square of the magnitude of the coefficients .

Ridge regression simply puts constraints on the coefficients (w). The term that penalizes the coefficients helps to regularize the optimization function.
In short , ridge regression tries to shrink the coefficients and help to reduce the model complexity and multi-collinearity . Examining the eqn 1.3 above we observe that when λ → 0 , the cost function becomes similar to the linear regression cost function .
Consider a simple scenario of modeling the ‘cosine’ function. Below I have simulated the cosine function with some noise. We will use this data for further analysis

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
from matplotlib.pylab import rcParams
rcParams[‘figure.figsize’] = 14, 12
import random

Creating the data:

Here I am going to take a cosine function with angles between 60 and 300 degrees. The corresponding cosine value is modifier with some random noise. The aim is to get as close to the cosine function as possible using regression models.


x = np.array([i*np.pi/180 for i in range(60,300,4)])
np.random.seed(10) #Setting seed for reproducability
y = np.cos(x) + np.random.normal(0,0.15,len(x))
data = pd.DataFrame(np.column_stack([x,y]),columns=[‘x’,’y’])
plt.plot(data[‘x’],data[‘y’],’+’)

When we fit a simple regression using the following code :

#Import Linear Regression model from scikit-learn.
from sklearn.linear_model import LinearRegression
def linear_regression(data, power, models_to_plot):

#initialize predictors:
predictors=[‘x’]
if power>=2:
predictors.extend([‘x_%d’%i for i in range(2,power+1)])

#Fit the model
linreg = LinearRegression(normalize=True)
linreg.fit(data[predictors],data[‘y’])
y_pred = linreg.predict(data[predictors])

#Check if a plot is to be made for the entered power
if power in models_to_plot:
plt.subplot(models_to_plot[power])
plt.tight_layout()
plt.plot(data[‘x’],y_pred)
plt.plot(data[‘x’],data[‘y’],’.’)
plt.title(‘Plot for power: %d’%power)

#Return the result in pre-defined format
rss = sum((y_pred-data[‘y’])**2)
ret = [rss]
ret.extend([linreg.intercept_])
ret.extend(linreg.coef_)
return ret
It returns



Determining ovefitting
:
we can create a 15 power
#Create powers upto 15:
for i in range(2,16):
colname = ‘x_%d’%i
data[colname] = data[‘x’]**i
print (data.head())



#Initialize a dataframe to store the results:

col = [‘rss’,’intercept’] + [‘coef_x_%d’%i for i in range(1,16)]

ind = [‘model_pow_%d’%i for i in range(1,16)]

coef_matrix_simple = pd.DataFrame(index=ind, columns=col)

#Define the powers for which a plot is required:

models_to_plot = {1:231,3:232,6:233,9:234,12:235,15:236}

#Iterate through all powers and assimilate results

for i in range(1,16):

coef_matrix_simple.iloc[i-1,0:i+2] = linear_regression(data, power=i, models_to_plot=models_to_plot)

Fig 1.1

#Set the display format to be scientific for ease of analysis

pd.options.display.float_format = ‘{:,.2g}’.format

coef_matrix_simple

NOTE: Though RSS is going down, but the coefficients are increasing in magnitude.

Now let us see what happens when we use Ridge modelling :

RIDGE MODELLING SAMPLE

from sklearn.linear_model import Ridge

def ridge_regression(data, predictors, alpha, models_to_plot={}):

#Fit the model

ridgereg = Ridge(alpha=alpha,normalize=True)

ridgereg.fit(data[predictors],data[‘y’])

y_pred = ridgereg.predict(data[predictors])

#Check if a plot is to be made for the entered alpha

if alpha in models_to_plot:

plt.subplot(models_to_plot[alpha])

plt.tight_layout()

plt.plot(data[‘x’],y_pred)

plt.plot(data[‘x’],data[‘y’],’.’)

plt.title(‘Plot for alpha: %.3g’%alpha)

#Return the result in pre-defined format

rss = sum((y_pred-data[‘y’])**2)

ret = [rss]

ret.extend([ridgereg.intercept_])

ret.extend(ridgereg.coef_)

return ret

Then ,

predictors=[‘x’]

predictors.extend([‘x_%d’%i for i in range(2,16)])

alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

col = [‘rss’,’intercept’] + [‘coef_x_%d’%i for i in range(1,16)]

ind = [‘alpha_%.2g’%alpha_ridge[i] for i in range(0,10)]

coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

models_to_plot = {1e-15:231, 1e-10:232, 1e-4:233, 1e-3:234, 1e-2:235, 5:236}

for i in range(10):

coef_matrix_ridge.iloc[i,] = ridge_regression(data, predictors, alpha_ridge[i], models_to_plot)

Fig 1.2

#Set the display format to be scientific for ease of analysis

pd.options.display.float_format = ‘{:,.2g}’.format

coef_matrix_ridge

Finally , we chech the coefficents this time :

coef_matrix_ridge.apply(lambda x: sum(x.values==0),axis=1)

alpha_1e-15     0
alpha_1e-10 0
alpha_1e-08 0
alpha_0.0001 0
alpha_0.001 0
alpha_0.01 0
alpha_1 0
alpha_5 0
alpha_10 0
alpha_20 0
dtype: int64

The smoothing effect from ridge regression is evident from the alpha values and the coeficients matrix grid compared to the linear regression.

Lasso Modeling:

We can express the cost function for Lasso (Least absolute shrinkage and selection operaor) regression as :

Cost fucntion for Lassso regression

Just like Ridge regression cost function , the lambda is zero when it equals what we have in eqn 1.2 . The difference here is that instead if taking the square of the coefficients, magitudes are taken into account . This type of regularization can lea to zero coeffinets which means some features are completey eliminated for evaluation of the output . Lasso therefore also hepps us in feature selection in addditon to reducing over-fitting .

To show this we would be using cancer data set that has about 30 features.So we can see our lasso helps with feature selection .

Fig 2.1 lasso regression and feature selection dependence on the regularizartion parameter value .

The code that gave this plot is given below :

import math

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np # difference of lasso and ridge regression is that some of the coefficients can be zero i.e. some of the features are

# completely neglectedfrom sklearn.linear_model import Lasso

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import Ridge

from sklearn.linear_model import Lasso

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

cancer=load_breast_cancer()

print (cancer.keys())

cancer_df = pd.DataFrame(cancer.data, columns=cancer.feature_names)#print cancer_df.head(3)

X = cancer.data

Y = cancer.target

X_train,X_test,y_train,y_test=train_test_split(X,Y, test_size=0.3, random_state=31)

lasso = Lasso()

lasso.fit(X_train,y_train)

train_score=lasso.score(X_train,y_train)

test_score=lasso.score(X_test,y_test)

coeff_used = np.sum(lasso.coef_!=0)

print (“training score:”, train_score )

print (“test score: “, test_score)

print( “number of features used: “, coeff_used)

lasso001 = Lasso(alpha=0.01, max_iter=10e5)

lasso001.fit(X_train,y_train)

train_score001=lasso001.score(X_train,y_train)

test_score001=lasso001.score(X_test,y_test)

coeff_used001 = np.sum(lasso001.coef_!=0)

print (“training score for alpha=0.01:”, train_score001)

print (“test score for alpha =0.01: “, test_score001)

print (“number of features used: for alpha =0.01:”, coeff_used)

lasso00001 = Lasso(alpha=0.0001, max_iter=10e5)

lasso00001.fit(X_train,y_train)

train_score00001=lasso00001.score(X_train,y_train)

test_score00001=lasso00001.score(X_test,y_test)

coeff_used00001 = np.sum(lasso00001.coef_!=0)

print (“training score for alpha=0.0001:”, train_score00001 )

print (“test score for alpha =0.0001: “, test_score00001)

print (“number of features used: for alpha =0.0001:”, coeff_used00001)

lr = LinearRegression()

lr.fit(X_train,y_train)

lr_train_score=lr.score(X_train,y_train)

lr_test_score=lr.score(X_test,y_test)

print (“LR training score:”, lr_train_score )

print (“LR test score: “, lr_test_score)

plt.subplot(1,2,1)

plt.plot(lasso.coef_,alpha=0.7,linestyle=’none’,marker=’*’,markersize=5,color=’red’,label=r’Lasso; $\alpha = 1$’,zorder=7) # alpha here is for transparency

plt.plot(lasso001.coef_,alpha=0.5,linestyle=’none’,marker=’d’,markersize=6,color=’blue’,label=r’Lasso; $\alpha = 0.01$’) # alpha here is for transparency

plt.xlabel(‘Coefficient Index’,fontsize=16)

plt.ylabel(‘Coefficient Magnitude’,fontsize=16)

plt.legend(fontsize=13,loc=4)

plt.subplot(1,2,2)

plt.plot(lasso.coef_,alpha=0.7,linestyle=’none’,marker=’*’,markersize=5,color=’red’,label=r’Lasso; $\alpha = 1$’,zorder=7) # alpha here is for transparency

plt.plot(lasso001.coef_,alpha=0.5,linestyle=’none’,marker=’d’,markersize=6,color=’blue’,label=r’Lasso; $\alpha = 0.01$’) # alpha here is for transparency

plt.plot(lasso00001.coef_,alpha=0.8,linestyle=’none’,marker=’v’,markersize=6,color=’black’,label=r’Lasso; $\alpha = 0.00001$’) # alpha here is for transparency

plt.plot(lr.coef_,alpha=0.7,linestyle=’none’,marker=’o’,markersize=5,color=’green’,label=’Linear Regression’,zorder=2)

plt.xlabel(‘Coefficient Index’,fontsize=16)

plt.ylabel(‘Coefficient Magnitude’,fontsize=16)

plt.legend(fontsize=13,loc=4)

plt.tight_layout()

plt.show()

OUTPUT:

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
training score: 0.5600974529893081
test score: 0.5832244618818156
number of features used: 4
training score for alpha=0.01: 0.7037865778498829
test score for alpha =0.01: 0.6641831577726227
number of features used: for alpha =0.01: 4
training score for alpha=0.0001: 0.7754092006936697
test score for alpha =0.0001: 0.7318608210757911
number of features used: for alpha =0.0001: 22
LR training score: 0.7842206194055069
LR test score: 0.7329325010888683

Looking at the above code we can see the following came into play :

1.The default value of regularization parameter in Lasso regression is 1.

2. Only 4 features were used instead out the 30

3. Training and testing score were low ,so the model was under-fitted

4. Reduce this under-fitting by reducing alpha and increasing number f iterations . Making alpha 0.001 and non -zero feature 10 test scores increases .Further increasing shows better results .

That concludes our introduction into Ridge and Lasso .

Find the complete code on my github : https://github.com/sotunboolamide/Ridge-and-Lasso-Regression

Hope the tutorial was helpful.

--

--