Machine Learning: Unsupervised Learning — Feature selection

Michele Cavaioni
Machine Learning bites
3 min readFeb 6, 2017

Having a lot of features to deal with is not necessarily a good thing to have.

Instead we need to always aim to identify a subset of features that are more relevant than others and that are responsible for the trends we observe in the data.

We want to select these features for two main reasons:

  • Knowledge discovery: finding the ones that matter.
  • Curse of dimensionality: the amount of data needed grows exponentially to the amount of features we have. It is a 2^N growth (where N represents the amount of features). When dealing with an exponential problem we need to be very careful on the quick size increase.

There are two ways to approach to this problem:

  • Filtering.
  • Wrapping.

Filtering:

The filtering method is represented by a search algorithm that acts as a “features selector” prior to the learning algorithm.

The advantage of this method is speed, since the learning algorithm has already fewer inputs to deal with.

On the other hand, there is no feedback from the learning algorithm, which has no inputs in the features selection. These are in fact filtered as isolated inputs. Some of them might be valuable when associated to others, but be discarded instead by the search algorithm.

Decision trees and boosting algorithms are using the filtering method. The first one in particular, takes advantage of the information gain (looking at the class label) for selecting the relevant features.

Besides information gain, other filtering criteria that can be used are:

  • Variance, entropy.
  • Running a neural network and removing the features with low weights.
  • Selecting independent, non-redundant features.

Wrapping:

This method combines the search and learning algorithms together. This means that the two communicate with each other, exchanging feedback on how well a feature does and therefore whether it is selected afterwards.

The advantage is exactly that the feedback is passed from the learning algorithm to the search one. Different from the previous method, this one takes into account the model bias. Unfortunately, this process is obviously slow.

There are several ways of performing the wrapping method, such as:

  • Hill climbing (i.e. gradient search).
  • Randomized optimization.
  • Forward search.
  • Backward search.

A few words on how the last two methods are performed.

The forward search algorithm looks at the features independently and keeps the one that’s best, evaluating the score given by the learning algorithm.

Once it selects the first feature (the one with the highest score) it adds a second one, chosen among all the remaining features, as the one that performs best in conjunction with the first feature.

Then it continues on with the third feature, performing the same type of selection, and finally stopping until it doesn’t see any substantial improvement from the previous step.

The backwards search is done basically the same way, but backwards. Therefore starting with all the features and seeing which one can be eliminated and continuing on until there is no substantial improvement.

Before I conclude this section, i would like to emphasize the definition of two words that are loosely used on feature section: relevance and usefulness.

In machine learning these words are crucially important.

Before describing the characteristics of both, I would like to introduce a concept called Bayes Optimal Classifier (B.O.C.) that is important to understand the concept of “relevance”.

The Bayes Optimal Classifier (B.O.C.) takes the weighted average of all the hypotheses, based on their probability of being the correct hypothesis given the data. It simply is the best I could possibly do on average.

Now that we have this notion we can express what “relevance” is in feature selection.

- A feature is strongly relevant if removing it degrades the B.O.C.

- A feature is instead weakly relevant if:

  • it is not strongly relevant.
  • a subset of features S exists, such that adding the feature to S improves the B.O.C.

- A feature is otherwise irrelevant.

Relevance measures the effect on B.O.C.

Usefulness instead measures the effect on a particular predictor.

Relevance is about information, while usefulness is about minimizing the error given a model or a learner.

Ultimately, relevance and usefulness are criteria to be considered for feature selection.

This blog has been inspired by the lectures in the Udacity’s Machine Learning Nanodegree. (http:www.udacity.com)

--

--

Michele Cavaioni
Machine Learning bites

Passionate about AI, ML, DL, and Autonomous Vehicle tech. CEO of CritiqueMatch.com, a platform that helps writers and bloggers to connect and exchange feedback.