Machine Learning — Introduction to Feature Selection and Backward Elimination

Mayank Shah
7 min readDec 30, 2018

--

“Focus on things that matter the most.” — Paula Carlson

The above quote forms the basis of this tutorial, and by the end of it, you will understand its reference to Machine Learning as well.

In this tutorial, we’re going to learn the importance of feature selection in Machine Learning. We’re going to understand one of the most widely used feature selection methods — Backward Elimination. To keep it simple, we shall use Multiple Linear Regression and understand how we can optimise its performance with Backward Elimination.

Before we get started, let’s first try to understand why feature selection is crucial. As a beginner, I always thought Machine Learning is all about the algorithms and the mathematics behind it. Sure, that’s where lies the heart and soul of Machine Learning, but when it comes to building applications for real world problems, it is important to focus on the data we’re feeding to the algorithm. Of course, no one wishes to eat what they don’t like! Similarly, Machine Learning algorithms work on one simple rule — Garbage In, Garbage Out. ‘Garbage in’ refers to noisy, less correlated data, and ‘Garbage out’ refers to poor performance of the algorithm.

It’s a misconception that the leaders of most Kaggle competitions are able to achieve optimal performance simply because of the computational power of their machines and the choice of the algorithm. However, that may not necessarily be the case always. The secret behind their victory is Feature Selection and Feature Creation. Carefully selecting (and creating) features that capture great insights requires a lot of patience and practice.

Often, real world data is noisy, and may contain features (variables) that do not necessarily have good correlation with with the output (predicted/dependent) variable. The idea behind ‘Feature selection’ is to study this relation, and select only the variables that show a strong correlation.

There’s quite a few advantages of this:

  1. Faster training time
  2. Decrease in complexity
  3. Improved accuracy and performance
  4. And the most important one — reduction in ‘over-fitting’

There are many different kinds of Feature Selections methods — Forward Selection, Recursive Feature Elimination, Bidirectional elimination and Backward elimination. The simplest and the widely used one is Backward elimination.

Before we dive into Backward elimination, let’s first understand the following terms — Statistical hypotheses and P-Value

Let’s assume, you have a coin. You initially claim that the coin is unbiased — meaning, the chances of getting a heads or a tails is 50%. On your first toss, you get a head. This does not change our assumption that the coin is unbiased. You proceed to toss the coin once again, and you get a heads. We still do not have enough evidence to question our assumption. However, as you proceed to toss it the next 8 times, you get a heads. So, out of 10 tosses, all 10 were heads. Now you start getting sceptical — the coin may be biased. To prove that the coin might be biased, you perform a hypothesis testing.

A P-value helps determine weather a hypothesis must be accepted or rejected.

A similar idea can be applied in Backward Elimination — weather a feature significantly impacts the output or not.

I will not be explaining in-depth about P-value and Hypothesis, but may read about it in-depth from the following links:

  1. http://www.mathbootcamps.com/what-is-a-p-value/

2. http://www.wikihow.com/Calculate-P-Value

3. https://stattrek.com/hypothesis-test/hypothesis-testing.aspx

I highly recommend reading these, as it is important to understand the concept of statistical hypothesis and P-value to fully understand Backward Elimination.

Once we have a basic understanding of the above, we can jump straight into Backward Elimination. Typically, it can be performed in just 4 simple steps:

  1. Select a significance level, say 5% (0.05)
  2. Fit a model with all features (variables)
  3. Consider the feature with the highest P-Value. If its P-value is greater than significance level (P > SL), go to step 4. Else, your model is ready.
  4. Eliminate this feature (variable).
  5. Fit a model with the new set of features, and go to step 3.

Now, to clearly understand this algorithm, let’s look at an implementation. Before we begin, make sure you have the following python libraries installed:

  1. Pandas
  2. Numpy
  3. SkLearn
  4. Statsmodels

The data set I am using can be found here — https://www.kaggle.com/karthickveerakumar/startup-logistic-regression#50_Startups.csv

Let’s begin!

First, let’s import our data:

import pandas as pddata = pd.read_csv('50_Startups.csv')
print(data.head())

Our data has 4 features — ‘R&D Spend’,‘Administration’,‘Marketing Spend’ and ‘State’. Given these, we have to predict the ‘Profit’.

Now, we store the dataframe in 2 numpy arrays — X and y:

X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

The very first step for training any machine learning algorithm is data preprocessing.

To begin with, we One-hot encode ‘State’, as it is a categorical feature.

from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder = LabelEncoder()
X[:, -1] = labelencoder.fit_transform(X[:, -1])
onehotencoder = OneHotEncoder(categorical_features = [-1])
X = onehotencoder.fit_transform(X).toarray()

Now that the categorical variable ‘State’ has been One-Hot encoded, we normalize ‘R&D Spend’,‘Administration’ and ‘Marketing Spend’ with Sklearn’s ‘StandardScaler’ library. But before that, we split the training and testing data. We fit the scaler on training data, and transform the testing data with this scaler.

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Now, we perform normalization:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train[:,3:] = scaler.fit_transform(X_train[:,3:])
X_test[:,3:] = scaler.transform(X_test[:,3:])

Now that all our data has been preprocessed, let’s begin with Backward Elimination. Let’s build a simple regression model and check its score. Later we’ll see how we can improve it with Backward Elimination.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,Y_train)
print(model.score(X_test,Y_test))

The score seems great. Before we begin with Backward elimination, we need to append ‘1’ at the beginning of our data set. Now, why is this important?

The equation of our line (or rather, plane) is y=b+m1.x1+m2.x2+m3.x3+m4.x4.

When we make a linear model with sklearn, the bias term ‘b’ is calculated separately. However, for performing Backward elimination, we are required to use the linear model provided by statsmodels library — which does not consider the bias term. Hence, by adding a dummy feature with value as ‘1’, our equation becomes y=b.x0+m1.x1+m2.x2+m3.x3+m4.x4 where x0 = 1.

import numpy as np
X_train = np.append (arr=np.ones([X_train.shape[0],1]).astype(int), values = X_train, axis = 1)

Now that our data has been preprocessed according to statsmodels library, we begin with analysing the ‘P-value’ of each feature once a linear model is built. We keep a track of the required features in a new array — ‘X_opt;

NOTE: Initially, we had 4 features. Now, we have 7 features – 3 numerical, 3 binary (after One-Hot encoding) and a dummy feature with value 1.

import statsmodels.formula.api as smX_opt = [0,1,2,3,4,5,6]
regressor = sm.OLS(Y_train, X_train[:,X_opt]).fit()
print(regressor.summary())

OLS stands for ‘Ordinary Least Squares’, which essentially trains a Linear model. From the summary, we observe that the higest P-value is for feature 5, which is way over our significance level of 0.05. Hence, we remove it.

X_opt = [0,1,2,3,4,6]
regressor_OLS = sm.OLS(Y_train, X_train[:,X_opt]).fit()
print(regressor_OLS.summary())

The highest P-value is for the last feature, which is above our significance level of 0.05. Hence we remove that.

X_opt = [0,1,2,3,4]
regressor_OLS = sm.OLS(Y_train, X_train[:,X_opt]).fit()
print(regressor_OLS.summary())

Now we see that all features are below our significance level, which means we can no longer eliminate features. The first 3 features x1,x2 and x3 are binary variables for the feature ‘State’ after One-Hot encoding. Hence, we’re essentially left with 2 features — State and R&D Spend. With these features, we’ll now create a Linear Model with Sklearn and test its score:

from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X[:,[0,1,2,3], y, test_size = 0.2, random_state = 0)scaler = StandardScaler()
X_train[:,3:] = scaler.fit_transform(X_train[:,3:])
X_test[:,3:] = scaler.transform(X_test[:,3:])
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,Y_train)
print('Model score: '+str(model.score(X_test,Y_test)))

Great! With this simple method, we tweaked our linear model’s performance.

--

--

Mayank Shah

Software Engineer working on Kubernetes, distributed systems and databases.