Which features to use in your model?

Halil Ertan
10 min readFeb 20, 2020

--

After various preprocessing and feature extraction methods, you get a feature list consisting of many features. Then a question arises, which features to use in your model? You can apply feature selection methodologies just before feeding the model. Feature selection is one of the crucial parts of the entire process beginning with data collection and ending with modelling. If you are developing in python, scikit learn offers you enormous built-in choices for feature selection. We can briefly analyze these solutions into three main categories which are filter, wrapper and embedded methods.

Photo by Dimitry Anikin on Unsplash

In order to make it clearer to view the results of different approaches, I use a popular wine quality dataset from UCI Machine Learning Repository. There are two separate datasets for red and white wine, I concat them and use single wine quality dataset in order to observe the results of different feature selection methods. This dataset is convenient for both classification and regression tasks. The dataset consists of 11 independent variables and 1 dependent variable which is the quality column as shown in Figure 1.

Figure 1 — A quick look at wine quality dataset

Filter Methods

Feature selection methods from this category basically assign score metrics for each feature with the help of various statistical calculations and then filter out the most irrelevant ones.

Variance Threshold

It is one of the most straightforward approaches, it drops the features with a variance value less than a specific threshold value which is given by you. Note that the higher variance a feature has, the more informative it is.

from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01) # Variance threshold
sel = selector.fit(df_wine_norm_features)
sel_index = sel.get_support()
df_wine_norm_vt = df_wine_norm_features.iloc[:, sel_index]
print(df_wine_norm_vt.columns)
#['fixed acidity', 'volatile acidity', 'total sulfur dioxide', 'pH', 'alcohol']

An important point to remember is that you may not get the same feature list with the same threshold value for normalized and non-normalized datasets. So it would be better to use a variance threshold before normalizing the dataset. For most of the other feature selection approaches, there will be no difference. However, I generally prefer implementing feature selection methods after normalizing the dataset.

Mutual_info_classif

This method basically utilizes mutual information. It calculates the mutual information value for each of the independent variables with respect to the dependent variable and selects the ones which have the most information gain. In other words, it basically measures the dependency of features with the target value. A higher score means more dependent variables.

from sklearn.feature_selection import mutual_info_classif
threshold = 5 # the number of most relevant features
high_score_features = []
feature_scores = mutual_info_classif(df_wine_norm_features, df_wine_target, random_state=0)
for score, f_name in sorted(zip(feature_scores, df_wine_norm_features.columns), reverse=True)[:threshold]:
print(f_name, score)
high_score_features.append(f_name)
df_wine_norm_mic = df_wine_norm_features[high_score_features]
print(df_wine_norm_mic.columns)
# density 0.158
# alcohol 0.148
# total sulfur dioxide 0.073
# residual sugar 0.065
# volatile acidity 0.062
#['density', 'alcohol', 'total sulfur dioxide', 'residual sugar', 'volatile acidity']

F_classif

It uses the ANOVA f-test for the features and takes into consideration only linear dependency, unlike mutual information-based feature selection which can capture any kind of statistical dependency. Notice that the scores which are produced by different methods are totally different. Don’t stick around this point, each method actually orders feature importance internally and returns top features.

from sklearn.feature_selection import f_classif
threshold = 5 # the number of most relevant features
high_score_features = []
feature_scores = f_classif(df_wine_norm_features, df_wine_target)[0]
for score, f_name in sorted(zip(feature_scores, df_wine_norm_features.columns), reverse=True)[:threshold]:
print(f_name, score)
high_score_features.append(f_name)
df_wine_norm_fc = df_wine_norm_features[high_score_features]
print(df_wine_norm_fc.columns)
# alcohol 320.593
# density 136.951
# volatile acidity 96.674
# chlorides 50.849
# free sulfur dioxide 14.939
# ['alcohol', 'density', 'volatile acidity', 'chlorides', 'free sulfur dioxide']

SelectKBest

This class is actually a more general approach compared to the above-mentioned classes since it takes an additional scoring function parameter which states which function to use in feature selection. So, you can think of it as a kind of wrapper. We can also use f_classif or mutual_info_class_if inside this object. On the other hand, it is typically used with chi2 function. This object returns p-values of each feature according to the chosen scoring function as well.

The chi-square test measures the dependence between stochastic variables, thus we can eliminate the features that are most likely to be independent of the target by using this function. It is basically for evaluating whether the difference between two separate groups of non-negative samples is by random chance or not.

Chi-squared test assumes a null hypothesis that two variables are independent and an alternative hypothesis that two variables are dependent like most of the other statistical tests. By using chi-square test, it calculates p-values of each feature relative to the target. In a simple manner, p is the probability that two variables are independent. Our aim is to determine features that are dependent on target, in other words rejecting the null hypothesis. For this reason, we select features with typically have p-value smaller than 0.05. A threshold value 0.05 is just a common behaviour, you can set smaller threshold values like 0.01 in order to be more sure that two groups are dependent.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
threshold = 5 # the number of most relevant features
skb = SelectKBest(score_func=chi2, k=threshold)
sel_skb = skb.fit(df_wine_norm_features, df_wine_target)
sel_skb_index = sel_skb.get_support()
df_wine_norm_skb = df_wine_norm_features.iloc[:, sel_skb_index]
print('p_values', sel_skb.pvalues_)
print(df_wine_norm_skb.columns)
# ['volatile acidity', 'residual sugar', 'chlorides', 'density', 'alcohol']

Pearson Correlation

The pearson correlation coefficient values are estimated between each feature including the target feature. These values range between -1 and 1. While values near -1 imply that two variables are inversely proportional, values near 1 imply that two variables are directly proportional. On the other hand, the values near 0 indicate that there is no significant correlation between variables. Unlike other methodologies, there is not a built-in class for implementing pearson correlation for feature selection in scikit-learn. Correlation between features is calculated with the help of corr method which is supplied for the pandas library. Its default value of method argument is ‘pearson’. Additionally, it can be visualized with the help of a heatmap thanks to seaborn library.

Two separate correlation relations can be examined, which are the correlation of each feature with the target and the correlation between features respectively. Intuitively, highly correlated features with targets can be selected. Besides, one of the highly correlated features can be removed with respect to the inference of highly correlated features can behave in a similar manner and there is no need to keep both of them. So, these assumptions bring us two threshold values; eliminate features with a lower correlation than a threshold with respect to target, and eliminate one of the features which have a correlation value higher than a threshold. The correlation heatmap of the wine quality dataset is the following.

Figure 2 — Heatmap of feature correlation in wine quality dataset

As can be seen from the code snippet below, we select the mentioned threshold values as 0.05 and 0.65 respectively in our case. In the first step, we drop 4 features which are not correlated with the “quality” column. And, in the second step, we eliminate one more column which is already highly correlated with one of the features.

import seaborn as sns
corr_with_label = 0.05 #correlation threshold for target
corr_between_features = 0.65 #correlation threshold between features
cor = df_wine_norm.corr()
# drop less correlated ones with target
corr_target = abs(cor['quality'])
relevant_features = corr_target[corr_target > corr_with_label]
df_wine_norm_corr = df_wine_norm[list(relevant_features.index)]
print('df_wine_norm_corr', df_wine_norm_corr.columns)
# ['fixed acidity', 'volatile acidity', 'citric acid', 'chlorides', 'free sulfur dioxide', 'density', 'alcohol', 'quality']
cor = df_wine_norm_corr.corr().abs()
plt.figure(figsize=(12, 8))
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
# drop highly correlated features
upper = cor.where(np.triu(np.ones(cor.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > corr_between_features)]
df_wine_norm_corr = df_wine_norm_corr.drop(df_wine_norm_corr[to_drop], axis=1)
df_wine_norm_corr = df_wine_norm_corr[[cols for cols in df_wine_norm_corr.columns if cols != 'quality']]
print('df_wine_norm_corr', df_wine_norm_corr.columns)
# ['fixed acidity', 'volatile acidity', 'citric acid', 'chlorides', 'free sulfur dioxide', 'density']

Wrapper Methods

Wrapper methods use iterative searches in order to narrow the features set until acquiring the desired number of features recursively. They are mainly used together with a model. Initially, the model is fed with all features and a feature importance score is assigned to each feature. Then the features with the least feature importance score values are pruned. This process continues iteratively until the desired number of features with a greedy approach. This subset size is a tuning parameter for RFE which is one of the prominent ways of wrapper methods. After evaluating the performance of the model with different subset sizes, you decide on the optimum size of the features set. In RFECV class, this is done with cross-validation. Moreover, RFE works better with some models compared to others. Random forest is one of them. On the other hand, there are wrapper methods with non-greedy approaches like genetic algorithms and simulated annealing as well.

from sklearn.feature_selection import RFE
threshold = 5 # the number of most relevant features
model_rf = RandomForestClassifier(n_estimators=500, random_state=0, max_depth = 3)
#model_lr = LogisticRegression(random_state=0, C=0.01)
selector = RFE(model_rf, threshold, step=1) # use model_lr as well
selector = selector.fit(df_wine_norm_features, df_wine_target)
selector_ind = selector.get_support()
df_wine_norm_rfe = df_wine_norm_features.iloc[:, selector_ind]
print(df_wine_norm_rfe.columns)
# LR: ['fixed acidity', 'volatile acidity', 'total sulfur dioxide', 'pH', 'alcohol']
# RF: ['volatile acidity', 'citric acid', 'chlorides', 'density', 'alcohol']

I also suggest taking a look at Boruta package, which is an improvement on the feature importance measurement of Random Forest and a kind of wrapper algorithm around Random Forest. Boruta follows an all-relevant feature selection method where it captures all features which are in some circumstances relevant to the outcome variable. It basically creates shadow features which are shuffled versions of already defined features, and compares the performance of the original features with the newly created randomized ones. Only features that have higher importance than shuffled features are taken into consideration while modelling.

Embedded Methods

Embedded methods focus on which features best contribute to the accuracy of the model while the model is being created. These kinds of methods have built-in feature importance evaluation mechanisms and include feature selection inside the training process. The most common type of embedded feature selection methods are regularization methods like Lasso and Ridge for regression. Logistic regression, SVM or tree-based algorithms can be considered for classification purposes. These types of algorithms force the model to set the coefficients of unimportant variables to 0 which means that such columns are not used.

In scikit-learn, there is a class named as SelectFromModel for implementing embedded methods for feature selection. Unlike wrapper methods, you do not need to explicitly give an argument for the size of features set in this approach. The features are removed, if their coefficients or feature importance values are below a threshold value which is calculated by using various heuristics like mean or median.

from sklearn.feature_selection import SelectFromModel
model_lr = LogisticRegression(random_state=0, C=0.01)
model_rf = RandomForestClassifier(n_estimators=500, random_state=0, max_depth = 3)
model_lr.fit(df_wine_norm_features, df_wine_target)
#model_rf.fit(df_wine_norm_features, df_wine_target)
sel_sfm = SelectFromModel(model_lr, prefit=True)
sel_sfm_index = sel_sfm.get_support()
df_wine_norm_sfm = df_wine_norm_features.iloc[:, sel_sfm_index]
print(df_wine_norm_sfm.columns)
# LR: ['fixed acidity', 'volatile acidity', 'total sulfur dioxide', 'pH', 'alcohol']
# RF: ['volatile acidity', 'chlorides', 'density', 'alcohol']

TO SUMMARY

Feature selection can be considered different from feature extraction and dimension reduction. In feature selection, we decide on a subset of features which already been created somehow to use in modelling, in terms of various considerations. Features can be given directly that are ready to use or can be produced from the raw data within the feature extraction scope. Feature extraction like extracting hour information from datetime in a simple manner, can be placed before feature selection in the flow. In dimension reduction, the original features are not kept, unlike feature selection. The features set is transformed into a smaller size and the variety of the training data is represented in a different dimension like PCA. For instance, you acquire 10 different features from 20 original features after dimension reduction, which are totally different from the original ones.

The motivation behind the feature selection is mainly to increase the accuracy of the model. Most of the time, some features does not contribute to model, sometimes even behaving as noise and decreasing the accuracy of the model. Moreover, the interpretability of the model is increased by decreasing the number of features. This also siginificantly increase the performance of the model with respect to time consuming. All in all, the main idea is to reduce the number of predictors as far as possible without compromising predictive performance or even increasing it.

Comparison

In this writing, I mention different feature selection approaches in scikit learn library for python. As it can be seen from the ultimate feature sets, we get different results for different approaches, so there is not a certain truth for feature selection. The filter methods are often univariate and consider the features independently, or with regard to the dependent variable. However, that is not always a correct assumption. Sometimes a feature does not make sense by itself, whereas it can make an improvement in the model with combinations of other features. So these kinds of situations are overlooked in filter methods. Furthermore, the evaluation of statistical measurements for feature selection does not serve directly for the performance of the model, so may not increase the accuracy of the model. On the other hand, these approaches are simple and advantageous compared to others with respect to training time.

Wrapper methods use iterative search procedures and are used with a model together. Unlike filter methods, it takes into consideration feature combinations and returns feature subset. It can be implemented in greedy or non-greedy ways. Greedy approaches like RFE may encounter problems like being trapped in local optima. They are generally slower than filter methods since they require massive amounts of computation considering the iterative approach.

Embedded methods are faster than wrapper methods since the selection process is embedded within the model-fitting process. They also provide a direct connection between feature selection and model performance. So you may get more satisfied results with embedded methods. One drawback of this approach is that it is model-dependent. Data may be better fit by a model which is not convenient for embedded kinds of feature selection.

Tips

— Feature selection should be applied within the inner loop when you are using accuracy estimation methods such as cross-validation in order to prevent probable over-fitting.

— Normalization is generally implemented before feature selection, but it depends on the derived feature selection method and data. So it is a not rule.

— If your data is imbalanced, you can balance the data before the feature selection stage. I encounter many approaches using this order. On the other hand, it is not a general rule as well and depends on conditions.

Useful Links

https://machinelearningmastery.com/an-introduction-to-feature-selection/

http://www.feat.engineering/selection.html

--

--