Feature selection:

Michell Payano Perez
7 min readJun 12, 2021

--

How should we perform it?

Photo by Clem Onojeghuo on Unsplash

Before moving forward to see how we should do feature selection, we must first know what feature selection is. As it is stated in “Applied predictive modeling 2013”, feature selection is focused on removing non-informative or redundant predictors from the model. Why is this important? Because those redundant variables can add complexity and reduce the performance of the model.

We are going to see in action some of the techniques available in scikit-learn, however, it is very important to note that feature selection goes in hand with the model used to make predictions, since there are some which are more sensitive than others to redundant variables.

The dataset used in this article is the Boston house prices. For more details, the Jupyter Notebook can be found here.

First attempt: Correlation Matrix

By computing the correlation matrix, we can detect which are the predictors that are highly correlated with the target value, but also with other predictors. When several predictors are correlated with others we face the problem of collinearity/multicollinearity, which means that we have redundant independent variables in our dataset. In this case, we should select a set of features that are independent from other features (low correlation) but correlated with the dependent variable.

boston_data = load_boston()
boston_x = pd.DataFrame(boston_data['data'], columns=boston_data['feature_names']) # regressors
boston_y = boston_data['target']
X_train, X_test, y_train, y_test = train_test_split(boston_x, boston_y, test_size=0.2, random_state=0)correlation_matrix = pd.concat([X_train.reset_index(),pd.Series(y_train,name="Price")],axis=1).corr().round(2)
plt.figure(figsize=(10,8))
cmap = sns.cm.rocket_r
sns.heatmap(data=correlation_matrix, annot=True,cmap=cmap)
plt.title("Correlation matrix");

After visualizing the plot above we would select the variables that have a coefficient close to -1 or 1 with respect to the target, indicating a strong correlation. Also, we would like to select the features that do not have a strong correlation with other features. As an example, RM (average number of rooms per dwelling) has a strong correlation with the target, however it also has a strong negative correlation with LSTAT (% lower status of the population). So, in this case, we can drop one of them and perform the same procedure until we finally get a subset of variables to predict the house prices.

However, is there something missing? Are we performing feature selection correctly? According to the authors of “The elements of statistical learning 2013”, we have to follow the steps below:

1. Divide the samples into K cross-validation folds (groups) at random.

2. For each fold k = 1, 2,… ,K:

2.1 Find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels, using all of the samples except those in fold k.

2.2 Using just this subset of predictors, build a multivariate classifier, using all of the samples except those in fold k.

2.3 Use the classifier to predict the class labels for the samples in fold k.

Why should we do it this way? Because, on the contrary, the features will be chosen based on all the samples, and this can lead to incorrect estimates when we try to make predictions with a completely independent test set. This problem is also known as Information or Data Leakage as the authors of “Feature Engineering and Selection 2020” describe, which refers to the use of the test set data (or validation data) during the training process.

With that been said, we are going to apply this procedure by using some of the supervised feature selection methods available in Scikit-Learn. The algorithms presented below are performed using a pipeline, to organize in a sequence the steps of the feature selection process and estimation of the house prices with the final subset of variables. Finally, this sequence will be introduced into a grid search to compute the cross-validation (5 folds) and to tune some of the hyperparameters.

Note: In the following codes there is no scaling method included in the pipeline because the models used are tree based, which are not sensitive to the magnitude of the variables just as is described in “Approaching (Almost) Any Machine Learning Problem 2020”.

Select features with the k highest scores

In this method we select the k features with the highest scores, which are measured by one of two functions since the target variable is continuous: f_regression or mutual_info_regression.

  1. f_regression: this is an univariate linear regression test. In other words, it tests the individual effect of each of the independent variables by calculating the correlation between each of these features individually with respect to the target and then obtain the F test which captures the linear dependency. After getting the corresponding F for each variable we look for its respective p value. For more information about the procedure this is the source code.
  2. 2. mutual_info_regression: Mutual information can be described in this case as the amount of information obtained from the regressor and the continuous target. In this package the procedure used is based on entropy estimates from k-nearest neighbor distances. For more information about the procedure this is the source code.
regressor = GradientBoostingRegressor()
score_func=[f_regression,mutual_info_regression]
method = SelectKBest()
pipe=Pipeline([("selectBest",method),("regressor",regressor)])
#selectBest__k refers to the number of top features to select, in this case we are going to try how many features (3, 6 or 9) maximizes the score function.result_1=GridSearchCV(pipe,{"selectBest__k":[3,6,9],"selectBest__score_func":score_func},cv=5,scoring="r2")result_1.fit(X_train,y_train)#Columns selected:
set_features_1=result_1.best_estimator_.named_steps["selectBest"]
selected=set_features_1.get_support(indices=True)
X_train.columns[selected]

After computing the feature selection using SelectKBest, the best combination of hyperparameters in this exercise correspond to 9 features which are: CRIM, INDUS, NOX, RM, AGE, RAD, TAX, PTRATIO, LSTAT, and the score function is f_regression.

#Predict new prices using the best hyperparameters found:
y_pred=result_1.predict(X_test)
RMSE_1=mean_squared_error(y_test,y_pred)**0.5
RMSE_1

This procedure gives a RMSE of 4.63.

Select features with Recursive Feature Elimination

This method applies a machine learning model to select a specific number of features. The RFE starts with all features and compute a measure of importance for each variable and then eliminates one of the least important (or the specified number of features) in each iteration. This process of calculating the importance and pruning the features is repeated in the resulting subset of features from the previous iteration, until it reaches the number of features we specified. For more information about the procedure this is the source code.

selector = RFE(RandomForestRegressor(), step=1)pipe=Pipeline([("RFE",selector),("regressor",regressor)])#RFE__n_features_to_select refers to the number of features to select in this case (3, 6 or 9)result_2=GridSearchCV(pipe,{"RFE__n_features_to_select":[3,6,9],
"RFE__estimator__n_estimators":[300,500,700]},cv=5,scoring="r2")
result_2.fit(X_train,y_train)#Columns selected:
set_features_2=result_2.best_estimator_.named_steps["RFE"]
selected=set_features_2.get_support(indices=True)
X_train.columns[selected]

After computing the feature selection using Recursive Feature Elimination, the best combination of hyperparameters in this exercise correspond to 9 features which are: CRIM, NOX, RM, AGE, DIS, TAX, PTRATIO, B, LSTAT, and the number of trees or estimators is 700.

#Predict new prices using the best hyperparameters found:
y_pred=result_2.predict(X_test)
RMSE_2=mean_squared_error(y_test,y_pred)**0.5
RMSE_2

This procedure gives a RMSE of 4.15.

Select features by Feature importance with Select From Model

This method also uses a model to rank the variables according to their respective measure of importance. However, the main difference between this one and the RFE, is that it does not perform an iteration to prune the least important variables. In other words, it uses a base estimator which will select a specific set of features depending on the attribute applied to compute the scores for each variable. For more information about the procedure this is the source code.

#The threshold is equal to -np.inf in this case because we are going to select the features only based on max_features     
selector = SelectFromModel(RandomForestRegressor(),threshold=-np.inf)
pipe=Pipeline([("SelectFromModel",selector),("regressor",regressor)])#SelectFromModel__max_features refers to the maximum number of features to select in this case (3, 6 or 9)result_3=GridSearchCV(pipe,{'SelectFromModel__max_features':[3,6,9],
'SelectFromModel__estimator__n_estimators': [300,500,700]},cv=5,scoring='r2')
result_3.fit(X_train,y_train)#Columns selected:
set_features_3=result_3.best_estimator_.named_steps["SelectFromModel"]
selected=set_features_3.get_support(indices=True)
X_train.columns[selected]

After computing the feature selection using SelectFromModel, the best combination of hyperparameters in this exercise correspond to 9 features which are: CRIM, NOX, RM, AGE, DIS, TAX, PTRATIO, B, LSTAT, and the number of trees or estimators is 300.

#Predict new prices using the best hyperparameters found:
y_pred=result_3.predict(X_test)
RMSE_3=mean_squared_error(y_test,y_pred)**0.5
RMSE_3

This procedure gives a RMSE of 4.05.

Select features by Sequential Feature Selection

This method uses a model to select a subset of features by performing an iteration process in two ways:

  1. Direction=”forward”: We start with an empty set of features and then select one feature based on the cross-validation scoring parameter, after that we add another feature of the remaining ones and proceed to select the best pair based on the cv scores and so on until we reach the set of features we desire.
  2. Direction=”backward”: Instead of initializing the algorithm with an empty set, we start with all the features and then proceed to remove one variable based on the scoring parameter and so on until we have deleted the necessary features to reach the limit.

For more information about the procedure this is the source code.

  • Forward Sequential Feature Selection
#Direction=Forward
sfs = SequentialFeatureSelector(RandomForestRegressor(),scoring='r2')
pipe=Pipeline([("SequentialForward",sfs),("regressor",regressor)])result_4=GridSearchCV(pipe,{'SequentialForward__n_features_to_select':[3,6,9],
'SequentialForward__estimator__n_estimators': [300,500,700]},cv=5,scoring='r2')
result_4.fit(X_train,y_train)
#Columns selected:
set_features_4=result_4.best_estimator_.named_steps["SequentialForward"]
selected=set_features_4.get_support(indices=True)
X_train.columns[selected]

After computing the feature selection using forward Sequential Feature Selector, the best combination of hyperparameters in this exercise correspond to 9 features which are: CRIM, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, LSTAT, and the number of trees or estimators is 300.

#Predict new prices using the best hyperparameters found:
y_pred=result_4.predict(X_test)
RMSE_4=mean_squared_error(y_test,y_pred)**0.5
RMSE_4

This procedure gives a RMSE of 4.02.

  • Backward Sequential Feature Selection
#Direction=Backward
sbs = SequentialFeatureSelector(RandomForestRegressor(),direction="backward",scoring='r2')
pipe=Pipeline([("SequentialBackward",sbs),
("regressor",regressor)])
result_5=GridSearchCV(pipe,{'SequentialBackward__n_features_to_select':[3,6,9],
'SequentialBackward__estimator__n_estimators': [300,500,700]},cv=5,scoring='r2')
result_5.fit(X_train,y_train)
#Columns selected:
set_features_5=result_5.best_estimator_.named_steps["SequentialBackward"]
selected=set_features_5.get_support(indices=True)
X_train.columns[selected]

After computing the feature selection using backward Sequential Feature Selector, the best combination of hyperparameters in this exercise correspond to 6 features which are: NOX, RM, DIS, TAX, PTRATIO, LSTAT, and the number of trees or estimators is 700.

y_pred=result_5.predict(X_test)
RMSE_5=mean_squared_error(y_test,y_pred)**0.5
RMSE_5

This procedure gives a RMSE of 4.07.

Conclusion

Feature selection is an import preprocessing step, however there is no perfect or ideal one, because it is about trial and error. Each feature selection method has its own pros and cons. For example, some methods are more computationally efficient than others, while others will select some correlated features which may affect the performance of models that are more sensitive to non-redundant or correlated variables.

I really hope that you enjoyed and learned new things by reading this article as much as I did while writing it.

--

--