# Feature Selection and Feature Extraction in Machine Learning: An Overview

Companies have more data than ever, so it’s crucial to ensure that your analytics team is uncovering actionable, rather than interesting data — knowing the difference between Interesting Data and Useful Data. Amongst the important aspects in Machine Learning are “**Feature Selection**” and “**Feature Extraction**”.

An universal problem of intelligent (learning) agents is where to focus their attention. It is very critical to understand *“What are the aspects of the problem at hand are important/necessary to solve it?”* i.e. discriminate between the relevant and irrelevant parts of experience.

**What is Feature selection (or Variable Selection)?**

Problem of selecting some subset of a learning algorithm’s input variables upon which it should focus attention, while ignoring the rest. In other words, **Dimensionality Reduction. **As Humans, we constantly do that!

Mathematically speaking,

- Given a set of features
=*F**{ f1 ,…, fi ,…, fn }*the Feature Selection problem is to find a subset that “maximizes the learner’s ability to classify patterns” - Formally
should maximize some scoring function*F’* - This general definition subsumes
**feature selection**(i.e. a**feature selection algorithm**also performs a mapping but can only map to subsets of the input variables)

## Feature Selection : The Two Schools of Thoughts

There are two schools of thoughts on **Feature Selection.**

While the above two school of thoughts co-exist, we’d focus here on the Motivation for **Feature Selection.**

- Especially when dealing with a large number of variables there is a need for
**Dimensionality Reduction** **Feature Selection**can significantly improve a learning algorithm’s performance

**The Curse of Dimensionality**

The required number of samples (to achieve the same accuracy) grows exponentially with the number of variables.

**In practice:** the number of training examples is fixed.

- the classifier’s performance usually will degrade for a large number of features

In many cases the information that is lost by discarding variables is made up for by a more accurate mapping/sampling in the lower-dimensional space.

**Feature Selection — Optimality?**

In theory, the goal is to find an optimal feature-subset (one that maximizes the scoring function).

In real world applications this is usually not possible.

- For most problems it is computationally intractable to search the whole space of possible feature subsets
- One usually has to settle for approximations of the optimal subset
- Most of the research in this area is devoted to finding efficient search-heuristics

**Optimal Feature Subset:**

- Often, the definition of optimal feature subset in terms of classifier’s performance
- The best one can hope for theoretically is the Bayes error rate

## Relevance of Features

There are several definitions of relevance in literature.

- Relevance of 1 variable, Relevance of a variable given other variables, Relevance given a certain learning algorithm ,..
- Most definitions are problematic, because there are problems where all features would be declared to be irrelevant
- This can be defined through two degrees of relevance:
**weak and strong relevance**. - A feature is relevant
it is weakly or strongly relevant and irrelevant (redundant) otherwise.*iff*

**Strong Relevance of a variable/feature:**

Let *Si = {f1, …, fi-1, fi+1, …fn}* be the set of all features except *fi*. Denote by *si *a value-assignment to all features in *Si*.

A feature *fi* is strongly relevant, iff there exists some *xi*, *y* and *si* for which *p(fi = xi, Si = si) > 0* such that

*p(Y = y | fi = xi; Si = si) ≠ p(Y = y | Si = si)*

This means that removal of *fi* alone will always result in a performance deterioration of an optimal Bayes classifier.

**Weak Relevance of a variable/feature:**

A feature fi is weakly relevant, iff it is not strongly relevant, and there exists a subset of features Si‘ of Si for which there exists some xi, y and si’ with p(fi = xi, Si’ = si’) > 0 such that

*p(Y = y | fi = xi; Si’ = si’) ≠ p(Y = y | Si’ = si’)*

This means that there exists a subset of features *Si’*, such that the performance of an optimal Bayes classifier on *Si’ *is worse than *Si’* **U** { *fi* }

**Summary**

**Feature selection**can significantly increase the performance of a learning algorithm (both accuracy and computation time) — but it is not easy!- One can work on problems with very high- dimensional feature-spaces
- Relevance <-> Optimality

Thanks for reading through this blog post. Any suggestions for further improving this would be cheerfully solicited.

For your quick reference, link to the next Blog Post in this Series.

**Other Blog Posts of your interest:**