Feature Selection and Feature Extraction in Machine Learning: An Overview

5 min readJul 19, 2018

Companies have more data than ever, so it’s crucial to ensure that your analytics team is uncovering actionable, rather than interesting data — knowing the difference between Interesting Data and Useful Data. Amongst the important aspects in Machine Learning are “Feature Selection” and “Feature Extraction”.

An universal problem of intelligent (learning) agents is where to focus their attention. It is very critical to understand “What are the aspects of the problem at hand are important/necessary to solve it?” i.e. discriminate between the relevant and irrelevant parts of experience.

Feature Selection and Feature Extraction in Machine Learning

What is Feature selection (or Variable Selection)?

Problem of selecting some subset of a learning algorithm’s input variables upon which it should focus attention, while ignoring the rest. In other words, Dimensionality Reduction. As Humans, we constantly do that!

Mathematically speaking,

Given a set of features F = { f1 ,…, fi ,…, fn } the Feature Selection problem is to find a subset that “maximizes the learner’s ability to classify patterns”
Formally F’ should maximize some scoring function
This general definition subsumes feature selection (i.e. a feature selection algorithm also performs a mapping but can only map to subsets of the input variables)

Feature Selection : The Two Schools of Thoughts

There are two schools of thoughts on Feature Selection.

While the above two school of thoughts co-exist, we’d focus here on the Motivation for Feature Selection.

Especially when dealing with a large number of variables there is a need for Dimensionality Reduction
Feature Selection can significantly improve a learning algorithm’s performance

The Curse of Dimensionality

The required number of samples (to achieve the same accuracy) grows exponentially with the number of variables.

In practice: the number of training examples is fixed.

the classifier’s performance usually will degrade for a large number of features

In many cases the information that is lost by discarding variables is made up for by a more accurate mapping/sampling in the lower-dimensional space.

Feature Selection — Optimality?

In theory, the goal is to find an optimal feature-subset (one that maximizes the scoring function).

In real world applications this is usually not possible.

For most problems it is computationally intractable to search the whole space of possible feature subsets
One usually has to settle for approximations of the optimal subset
Most of the research in this area is devoted to finding efficient search-heuristics

**Feature Selection** and Feature Extraction

Optimal Feature Subset:

Often, the definition of optimal feature subset in terms of classifier’s performance
The best one can hope for theoretically is the Bayes error rate

Relevance of Features

There are several definitions of relevance in literature.

Relevance of 1 variable, Relevance of a variable given other variables, Relevance given a certain learning algorithm ,..
Most definitions are problematic, because there are problems where all features would be declared to be irrelevant
This can be defined through two degrees of relevance: weak and strong relevance.
A feature is relevant iff it is weakly or strongly relevant and irrelevant (redundant) otherwise.

Strong Relevance of a variable/feature:

Let Si = {f1, …, fi-1, fi+1, …fn} be the set of all features except fi. Denote by si a value-assignment to all features in Si.

A feature fi is strongly relevant, iff there exists some xi, y and si for which p(fi = xi, Si = si) > 0 such that

p(Y = y | fi = xi; Si = si) ≠ p(Y = y | Si = si)

This means that removal of fi alone will always result in a performance deterioration of an optimal Bayes classifier.

Weak Relevance of a variable/feature:

A feature fi is weakly relevant, iff it is not strongly relevant, and there exists a subset of features Si‘ of Si for which there exists some xi, y and si’ with p(fi = xi, Si’ = si’) > 0 such that

p(Y = y | fi = xi; Si’ = si’) ≠ p(Y = y | Si’ = si’)

This means that there exists a subset of features Si’, such that the performance of an optimal Bayes classifier on Si’ is worse than Si’ U { fi }

Summary

Feature selection can significantly increase the performance of a learning algorithm (both accuracy and computation time) — but it is not easy!
One can work on problems with very high- dimensional feature-spaces
Relevance <-> Optimality

Thanks for reading through this blog post. Any suggestions for further improving this would be cheerfully solicited.

For your quick reference, link to the next Blog Post in this Series.

Feature Selection in Machine Learning: Variable Ranking and Feature Subset Selection Methods

Among the important aspects in Machine Learning are “Feature Selection” and “Feature Extraction”. In this blog post, we…

medium.com

**StrataHive:** Analytics, Machine Learning, Big Data, Digital Marketing, User Experience

Other Blog Posts of your interest: