Common Methods for Feature Selection You should Know

Chuan Xu
4 min readJun 18, 2018

--

Photo by Chuan Xu, all rights reserved

The goal of a feature selection algorithm is to find the optimal feature subset using an evaluation measure. The choice of evaluation metric distinguish the three main strategies of feature selection algorithms: the wrapper strategy, the filter strategy, and the embedded strategy.

Wrapper Methods

Wrapper methods train a new model for each subset and use the error rate of the model on a hold-out set to score feature subsets. Wrapper methods are subdivided into exhaustive search, heuristic search, and random search.

Exhaustive search like BFS (Breadth First Search) enumerates all possible feature combinations. These methods are rarely used in practice since the time complexity would be O(2^n).

Non-exhaustive search methods are optimizations based on exhaustive search. For example, Branch and Bound Search saves time by cutting off branches that are unlikely to search for a solution better than the currently found optimal solution.

Heuristic search has SFS (Sequential Forward Selection) and SBS (Sequential Backward Selection). SFS starts from an empty set. Each time a feature x is added to the feature subset X so that the evaluation metric could be optimized. SBS, on the other hand, starts from the universal set and deletes a feature x each time. Both SFS and SBS are greedy algorithms that likely fall into the local optimum.

BDS (Bidirectional Search) uses SFS and SBS at the same time and stops search when both find the same feature subset.

LRS (Plus-L Minus-R Selection) combines the idea of ​​SFS and SBS. The algorithm has two forms. The algorithm starts from an empty set, first adding L features in each round, and then removing R features so that the metric evaluation value is optimal. Alternatively, the algorithm starts from the universal set, eliminating R features in each round, and then adding L features to make the evaluation function value optimal. The choice of L and R is the key to this algorithm.

Random search methods first randomly generate a subset of features and then perform other algorithms on that subset. For instance, RGSS (Random Generation plus Sequential Selection) perform SFS and SBS on a randomly selected subset of feature to jump out of the local optimum. However, random search methods depend on random factors so that experimental results are difficult to reproduce.

Wrapper methods usually provide the best performing feature set for a particular type of model. However, they are very computationally intensive since for each subset a new model needs to be trained

Filter Methods

Instead of the error rate, filter methods use evaluation criteria from the intrinsic connections between features to score a feature subset. Filter methods are independent to the type of predictive model. The result of a filter would be more general than a wrapper.

Filters are usually less computationally intensive than wrappers. Filter methods have also been used as a preprocessing step for wrapper methods, allowing using a wrapper on more massive problems.

Common measures have four types: distance metrics, correlation, mutual information, and consistency metrics.

A distance metric on a feature set is a function of two features. There are many ways to define distance according to different problems. For feature selection, we could use interclass distance or intraclass distance.

Correlation coefficient indicates the dependency between features. The most common measure is the Pearson’s correlation coefficient. It is obtained by dividing the covariance of the two variables by the product of their standard deviations. There are other correlation measures such as Spearman’s rank correlation coefficient and Kendall’s rank correlation coefficient.

Mutual information is more general than correlation coefficient. It quantifies the “the amount of information” obtained about one feature through the other feature. Mutual information determines how similar the joint distribution p(X, Y) is to the products of factored marginal distribution p(X)p(Y). Mutual information is the expected value of the point-wise mutual information.

A favorite consistency metrics is chi-square. To use it for feature selection, we calculate chi-square between each feature and the target. The intuition is that if features that are independent to the target are uninformative.

Embedded Methods

In embedded techniques, the feature selection algorithm is integrated as part of the learning algorithm.

The most typical embedded technique is decision tree algorithm. Decision tree algorithms select a feature in each recursive step of the tree growth process and divide the sample set into smaller subsets. The more child nodes in a subset are in the same class, the more informative the features are. The process of decision tree generation is also the process of feature selection. ID3, C4.5, and CART are all common decision tree algorithms.

Other exemplars of this approach are the LASSO with the L1 penalty and Ridge with the L2 penalty for constructing a linear model. These two methods shrink many features to zero or almost zero.

--

--

Chuan Xu

Data Science graduate at University of San Francisco, Data Science Intern at Kiva. https://www.linkedin.com/in/carrieelisexu/