Just Don’t Overfit

Overfitting is a major problem for most of ML enthusiasts and this article will walk you through some of the techniques to get rid of this problem.

Brij Patel

Published in

Analytics Vidhya

6 min readAug 20, 2019

I am back with an interesting dataset, recently encountered on Kaggle. You can find the dataset here.

Before giving any hints relating to the database just have a look at the proportion of training and testing set.

Testing data is 79 times that of training data.

Looking from the proportion of training and testing dataset, we can clearly foresee that we are going to have a problem of overfitting in our model. This could be handled by adding some bias in our model or rather by using LASSOCV(don’t get all tensed up if you don’t know what it means, I will make this all a piece of cake!). So what are we waiting for, let’s begin!

Import Relevant Libraries

import numpy as np 
import pandas as pd 
from sklearn.linear_model import LassoCV , LassoLarsCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

You guessed right, we will be using LassoCV for increasing bias and reducing variance. Even if you are not getting any idea now, don’t worry just trust me by the end of this article you will have a good insight of Lasso Regression and how it is helpful in reducing overfitting.

Let’s load our guns and aim towards target!

train = pd.read_csv('../input/dont-overfit-ii/train.csv')
test = pd.read_csv('../input/dont-overfit-ii/test.csv')train.shape , test.shape

train.head()

So after splitting the data into independent and dependent variables we get something like this :

y_train = train['target']
x_train  = train.drop(['id' ,'target'] ,axis = 1)
test = test.drop('id',axis = 1)

We are all ready for further process , but what should we do next? For now let’s start with the basics. What we can do is normalize the data, but hold on a moment, we first need to understand how to do normalization and mainly why to do normalization.

Need for Normalization / Standardization :

Normalization Vs Standardization

Standard Scaler: It transforms the data in such a manner that it has mean as 0 and standard deviation as 1. In short, it standardizes the data. Standardization is useful for data which has negative values. It arranges the data in normal distribution. It is more useful in classification than regression.
Normalizer : It squeezes the data between 0 and 1. It performs normalization. Due to the decreased range and magnitude, the gradients in the training process do not explode and you do not get higher values of loss. It is more useful in regression than classification.

Why?

What happens in real scenario is that all the columns may have different units like if there is a column of ‘age’ which can have values like 20, 25, 18 etc.

Likewise, there can be another column of ‘salary’ which can have values like 20,00,000 or 50,00,000(hardworking guys!). So the main reason is that if we want to co-relate both this columns it will be very inefficient to solve this problem and therefore, to convert both these columns in same units standardization/normalization is convenient.

How?

Formula to calculate Standard Scaler is :

Back to Code :

scaler  = StandardScaler()
x_train  = scaler.fit_transform(x_train)
test = scaler.fit_transform(test)

Now we have converted our data into StandardScaler.Moving to the next phase which is to apply LASSOCV (Least Absolute Shrinkage And Selection Operator Cross Validation)

LASSOCV (Least Absolute Shrinkage And Selection Operator Cross Validation)

What is this term LASSOCV , to understand it let’s split the term in two parts LASSO and CV.

Let’s understand the first part : LASSO (Least Absolute Shrinkage And Selection Operator)

Let’s say we have these two points in our training set and we need to fit a line with minimum residual, it is obvious that we would fit the line shown below:

Now if we check our model on the testing set we get something like this :

We can clearly see from the results , the residual is very high comparatively than it was in training data set.

Which means that we have highly over-fitted the dataset resulting in high variance and low bias.This sample scenario is very much similar to our dataset as training data is very less comparative to testing data.

To solve this problem, LASSO regression is used as it major functionality is to include some bias in the prediction which can be helpful in long run prediction.

LASSO regression adds penalty which is represented by Greek letter symbol lambda multiplied by the slope of line.

It helps in increasing the bias and decreasing the residual.

If you still have some doubts, here is a nice tutorial.

Now that we have understood LASSO let’s understand what do we mean by Cross Validation.

Cross Validation

One important attribute in Lasso function is the value of alpha. Keeping cross validation aside, the math behind calculating value of alpha is pretty interesting, but practically, what you need to know is that Lasso regression comes with a parameter, alpha and the higher the alpha, more feature coefficients are zero and the lower the value of alpha, lasso regression will be similar to that of linear regression.

Cross validation is used in LASSO to calculate output for multiple values of alpha using a function call LASSOCV.

Back to Code:

model_lasso = LassoCV(alphas =  [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]).fit(x_train, y_train)
lasso_preds = model_lasso.predict(test)

As seen in the code we have taken a list of alphas from 0.05 to 75 and the best result will be stored in lasso_preds.

Finally , we are at the last phase of our model where we need to predict the output and submit it.

#Converting the list into a column , so that it can be stored in the test.csv filelasso_preds=lasso_preds.T   #.T stands for transpose.

Loading the sample submission file :

submit = pd.read_csv('../input/dont-overfit-ii/sample_submission.csv')
submit.head()

After overwriting the target with out prediction we get :

submit['target'] = lasso_preds
submit.head()

Finger’s crossed and let’s submit this csv file to see our score.

submit.to_csv('submit.csv', index = False)

Well that’s all for now , feel free for any suggestions. I am always open for learning something new also some comments would be highly appreciated!