Feature Selection and Feature Extraction in Machine Learning: An Overview

Companies have more data than ever, so it’s crucial to ensure that your analytics team is uncovering actionable, rather than interesting data — knowing the difference between Interesting Data and Useful Data. Amongst the important aspects in Machine Learning are “Feature Selection” and “Feature Extraction”.

An universal problem of intelligent (learning) agents is where to focus their attention. It is very critical to understand “What are the aspects of the problem at hand are important/necessary to solve it?” i.e. discriminate between the relevant and irrelevant parts of experience.

Feature Selection and Feature Extraction in Machine Learning

Problem of selecting some subset of a learning algorithm’s input variables upon which it should focus attention, while ignoring the rest. In other words, Dimensionality Reduction. As Humans, we constantly do that!

What is Feature selection (or Variable Selection)?

Mathematically speaking,

  • Given a set of features F = { f1 ,…, fi ,…, fn } the Feature Selection problem is to find a subset that “maximizes the learner’s ability to classify patterns”
  • Formally F’ should maximize some scoring function
  • This general definition subsumes feature selection (i.e. a feature selection algorithm also performs a mapping but can only map to subsets of the input variables)

There are two schools of thoughts on Feature Selection.

Feature Selection : The Two Schools of Thoughts

While the above two school of thoughts co-exist, we’d focus here on the Motivation for Feature Selection.

  • Especially when dealing with a large number of variables there is a need for Dimensionality Reduction
  • Feature Selection can significantly improve a learning algorithm’s performance
The Curse of Dimensionality

The required number of samples (to achieve the same accuracy) grows exponentially with the number of variables.

In practice: the number of training examples is fixed.

  • the classifier’s performance usually will degrade for a large number of features

In many cases the information that is lost by discarding variables is made up for by a more accurate mapping/sampling in the lower-dimensional space.

In theory, the goal is to find an optimal feature-subset (one that maximizes the scoring function).

In real world applications this is usually not possible.

  • For most problems it is computationally intractable to search the whole space of possible feature subsets
  • One usually has to settle for approximations of the optimal subset
  • Most of the research in this area is devoted to finding efficient search-heuristics
Feature Selection and Feature Extraction

Optimal Feature Subset:

  • Often, the definition of optimal feature subset in terms of classifier’s performance
  • The best one can hope for theoretically is the Bayes error rate

There are several definitions of relevance in literature.

  • Relevance of 1 variable, Relevance of a variable given other variables, Relevance given a certain learning algorithm ,..
  • Most definitions are problematic, because there are problems where all features would be declared to be irrelevant
  • This can be defined through two degrees of relevance: weak and strong relevance.
  • A feature is relevant iff it is weakly or strongly relevant and irrelevant (redundant) otherwise.

Strong Relevance of a variable/feature:

Let Si = {f1, …, fi-1, fi+1, …fn} be the set of all features except fi. Denote by si a value-assignment to all features in Si.

A feature fi is strongly relevant, iff there exists some xi, y and si for which p(fi = xi, Si = si) > 0 such that

p(Y = y | fi = xi; Si = si) ≠ p(Y = y | Si = si)

This means that removal of fi alone will always result in a performance deterioration of an optimal Bayes classifier.

Weak Relevance of a variable/feature:

A feature fi is weakly relevant, iff it is not strongly relevant, and there exists a subset of features Si‘ of Si for which there exists some xi, y and si’ with p(fi = xi, Si’ = si’) > 0 such that

p(Y = y | fi = xi; Si’ = si’) ≠ p(Y = y | Si’ = si’)

This means that there exists a subset of features Si’, such that the performance of an optimal Bayes classifier on Si’ is worse than Si’ U { fi }

  • Feature selection can significantly increase the performance of a learning algorithm (both accuracy and computation time) — but it is not easy!
  • One can work on problems with very high- dimensional feature-spaces
  • Relevance <-> Optimality

Thanks for reading through this blog post. Any suggestions for further improving this would be cheerfully solicited.

For your quick reference, link to the next Blog Post in this Series.

StrataHive: Analytics, Machine Learning, Big Data, Digital Marketing, User Experience

Other Blog Posts of your interest:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store