Don’t overfit! Why we need to select factors for data science.

Published in

quaintitative

4 min readSep 18, 2018

Don’t overfit! Why we need to select factors for data science.

We went through multi-variate regression previously here.

In deciding on what features to include in a regression (and for any model in data science), it is definitely not a good thing to throw the kitchen sink at the problem by including all possible features.

Adding more and more factors into your regression may make the model’s performance seem better and better on the data it is trained on.

In fact, you could potentially get close to perfect performance if you throw enough factors into a regression model. The model you get get would just trace each and every data point that you used.

However, such an approach is a generally a bad idea. Aside from the issues relating to collecting more data and making sure the sample size is adequate, the more important issue is with regards to to overfitting.

Overfitting happens when a model fits so well to the development or training data, that it will perform very badly once we give it any data (from the real world) that is different from the training set.

For this post, I use data from Kaggle. The dataset is a survey of young people on a broad range of interests and preferences, obtained from here.

It’s a fairly large dataset, with around 150 features/columns. I shall see if one can predict the respondent’s happiness in life (‘Happiness in life’) based on some other answers -

Whether he/she enjoys music (‘Music’)
Whether he/she likes to watch movies (‘Movies’)
Whether he/she likes socialising (‘Fun with friends’)
Whether he/she fears some things (‘Flying’, ‘Storm’, ‘Darkness’, ‘Heights’, ‘Spiders’, ‘Snakes’, ‘Rats’)
How are he or she is (‘Daily events’)
Whether he or she prefers money or friends (‘Friends versus money’)

We first extract these features or variables.

independent_var = young_responses[['Music', 'Movies', 'Fun with friends',
				'Flying', 'Storm', 'Darkness', 'Heights', 'Spiders', 'Snakes', 'Rats',
			   'Daily events', 'Friends versus money']]
dependent_var = young_responses[['Happiness in life']]

X = independent_var.values
y = dependent_var.values

Next we perform a linear regression.

from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=3)

lm = LinearRegression()
lm.fit(X_train,y_train)

print ('Train (cases, features) = %s' % str(X_train.shape))
print ('Test  (cases, features) = %s' % str(X_test.shape))
print ('In-sample  mean squared error %0.3f' % mean_squared_error(y_train,lm.predict(X_train)))
print ('Out-sample mean squared error %0.3f' % mean_squared_error(y_test,lm.predict(X_test)))Out: 
Train (cases, features) = (471, 12)
Test  (cases, features) = (203, 12)
In-sample  mean squared error 0.617
Out-sample mean squared error 0.631

To prove what we mentioned earlier about overfitting, let’s do the same with a 2nd order and 3rd order polynomial.

from sklearn.preprocessing import PolynomialFeatures
second_order=PolynomialFeatures(degree=2, interaction_only=False)
third_order=PolynomialFeatures(degree=3, interaction_only=True)

lm.fit(second_order.fit_transform(X_train),y_train)
print ('(cases, features) = %s' % str(second_order.fit_transform(X_train).shape))
print ('In-sample  mean squared error %0.3f' % mean_squared_error(y_train,lm.predict(second_order.fit_transform(X_train))))
print ('Out-sample mean squared error %0.3f' % mean_squared_error(y_test,lm.predict(second_order.fit_transform(X_test))))Out:
In-sample  mean squared error 0.522
Out-sample mean squared error 0.724lm.fit(third_order.fit_transform(X_train), y_train)
print ('(cases, features) = %s' % str(third_order.fit_transform(X_train).shape))
print ('In-sample  mean squared error %0.3f' %
mean_squared_error(y_train,lm.predict(third_order.fit_transform(X_train))))
print ('Out-sample mean squared error %0.3f' %
mean_squared_error(y_test,lm.predict(third_order.fit_transform(X_test))))Out:
In-sample  mean squared error 0.217
Out-sample mean squared error 20.591

You can see that even as in-sample performance improves, out-of-sample performance drops dramatically.

The obvious way to solve the issue of overfitting is to start getting rid of extraneous variables. One quick way of doing this is to eliminate factors that are correlated with each other. We can simply examine a correlation matrix to do this.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline

# Reference http://stackoverflow.com/questions/14391959/heatmap-in-matplotlib-with-pcolor
# http://matplotlib.org/api/axes_api.html
def corr_matrix_plot(data, threshold=0):
# returns pearson correlation coeff, rowvar means cols are the features
	R = np.corrcoef(data, rowvar=0)
	R[np.where(np.abs(R)<threshold)] = 0.0
	heatmap = plt.pcolor(R, cmap=mpl.cm.coolwarm, alpha=0.8)
	heatmap.axes.set_frame_on(False) 
# Set whether the axes rectangle patch is drawn
	plt.xticks(rotation=90)
	plt.tick_params(axis='both', which='both', bottom='off',top='off', left = 'off',right = 'off')
	plt.colorbar()
	plt.show()

corr_matrix_plot(X_train, threshold=0.0)

Factor Selection

Basically, other than eliminating variables with high correlation (and by tenuous inference — high collinearity), simplest way would be to step through the inclusion of each feature, and use statistical tests to determine if we should include them.

Scikit-learn offers three possible tests:

The f_regression class, which works out an F-test (a statistical test for comparing different regression solutions) and a p-value (interpretable as the probability value in which we observed a difference by chance) and reveals the best features for a regression
The f_class, which is an Anova F-test (a statistical test for comparing differences among classes), another statistical and related method that will prove useful for classification problems
The Chi2 class, which is a chi-squared test (a statistical test on count data), a good choice when your problem is classification and your answer variable is a count or a binary (in every case, a positive number such as units sold or money earned)

It’s actually quite straight-forward. We can use the f-class test to find the most statistically significant features, and keep 50% of them.

# using the f_class test
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile=50)
selector.fit(X_train, y_train)
variable_filter = selector.get_support()

We can plot the histogram to see the distribution of the scores (scores is on the x axis, y axis shows the number of features with these scores).

We can further find the top ones by a simple filter.

# picking the top features
variable_filter = selector.scores_ > 5
print ("Number of filtered variables: %i" % np.sum(variable_filter))
from sklearn.preprocessing import PolynomialFeatures
interactions = PolynomialFeatures(degree=2, interaction_only=True)
Xs = interactions.fit_transform(X_train[:,variable_filter])
print ("Number of variables and interactions: %i" % Xs.shape[1])

The full set of code is in the notebook here.

Don’t overfit! Why we need to select factors for data science.

Written by playgrdstar