ML basics : Feature Selection (Part 2)

Abhinav Mahapatra
4 min readJun 7, 2018

--

Welcome back.

The background story of what exactly are we doing here is in here.

Now, moving forward.

Backward Elimination

We will be selecting features based on backward elimination method.

What exactly is backward elimination? This is a method to keep only those features which are significant to the dataset i.e. considering those features do considerable amount of change to the dependent variable.

Here is the way this algorithm works :

  1. Select a significance level.
  2. Fit the model with all features.
  3. Check the p-values of different features with summary() function.
  4. If p-value is higher than significance level, remove the feature.
  5. Repeat step 2 to 4 with the reduced features till only the features having p-values ≤ significance level remain.
Basically! courtesy- Udemy

Significance level and p-value

So, significance level is the amount of change a feature will affect towards the final output i.e. how important is this feature and how much it affects the final output. Generally, we take 5%/0.05 significance level by default.

p-value refers to the hypothesis of the significance level.

Let’s say you have a friend who says that a feature is absolutely of no use. (that is called as null hypothesis). The higher the p-value’s value is, the more he is correct and vice versa.

p-value goes from 0 to 1.

So say column 1 has p-value of 0.994, null hypothesis is true i.e. this column does not provide any noticeable change to the output and can be easily removed without consequences.

Now, column 2 has a p-value of 0.001, null hypothesis is false i.e. is provides very significant change to the output.

Where do we draw the line of something being significant or not? Well that is where significance level comes in.

In our case we won’t consider any p-values over 0.05.

Code:

# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the Test set results
y_pred = regressor.predict(X_test)
#adding a column of ones for b0x0
X = np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1)
#preparing for the backward elimination for having a proper model
import statsmodels.formula.api as sm

Code explaination :

The above code block is simply training a dataset in linear regression. Added b0x0 as that will be the first feature column. Since,

y = mx + c or

y = b0x0 + b1x1

Basically it is a column of ones. Next step is to import statsmodel.formula.api module as sm.

#creating matrix of features by backward elimination
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Result of the summary() function.
Second summary()

Removing the highest p-value(x2 or 2nd column) and rewriting the code.

X_opt = X[:, [0, 1, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Third summary()

Removing the highest p-value(x1 or first column) and rewriting the code.

X_opt = X[:, [0, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Fourth Summary()

Removing the highest p-value(x3 or 4th column) and rewriting the code.

X_opt = X[:, [0, 3, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

Now, we have a decision to make, we can see that x2 or 5th column has a p-value of 0.06 or 6%. It is quite close to the significance level of 5% so we are not completely sure if removing it will be a good idea.

When in doubt, we will check the adjusted R-square values for higher accuracy.

Current Adj. r-squared value : 94.8%

Fifth/Final Summary()

Removing the highest p-value(x2 or 5th column) and rewriting the code.

X_opt = X[:, [0, 3]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

New Adj. r-squared value : 94.5%

We can see our accuracy decreased thus we will keep the previous columns. Thus the relevant columns for us in this dataset are 3rd and 5th column and thus we can drop all the other columns without having any decrease in model accuracy.

Well that is it for now.

Bye.

--

--