Feature Selection in Data Preprocessing

Wojtek Fulmyk, Data Scientist
8 min readAug 3, 2023

--

Article level: Advanced

My clients often ask me about the specifics of certain data pre-processing methods, why they’re needed, and when to use them. I will discuss a few common (and not-so-common) preprocessing methods in a series of articles on the topic.

In this preprocessing series:

Data Standardization — A Brief Explanation — Beginner
Data Normalization — A Brief Explanation — Beginner
One-hot Encoding — A Brief Explanation — Beginner
Ordinal Encoding — A Brief Explanation — Beginner
Missing Values in Dataset Preprocessing — Intermediate
Text Tokenization and Vectorization in NLP — Intermediate

Outlier Detection in Dataset Preprocessing — Intermediate

Feature Selection in Data Preprocessing — Advanced

In this specific short writeup I will explain how to perform feature selection on your dataset. I generally give a glossary at the beginning of each article, but for advanced and expert articles I will assume that readers have a working knowledge of some aspects of ML, so I will do away with the glossary. Give it a go, and if you need more info, just ask in the comments section!

Feature Selection

Feature selection is an important step in machine learning pipelines. It refers to selecting the most relevant features to use for model training. Reducing the number of features can simplify models, shorten training times, improve accuracy, and prevent overfitting. There are many techniques for selecting subsets of features.

Filter Methods

Filter methods select features based on statistical properties of the data. Simple filters like variance threshold remove low variance features that don’t provide much information. Statistical tests like ANOVA can identify features that show correlation with the target variable. Information theory approaches like mutual information measure the predictive power of each feature. Filter methods are fast and scalable but ignore effects of the model algorithm.

  • Pearson correlation measures linear relationship between feature x and target y. Higher r means stronger linear correlation.

r = (Sum of (x — mean of x) * (y — mean of y)) / (standard deviation of x * standard deviation of y)

Example: Given height and weight data for 100 people, the correlation coefficient r is 0.7. This positive r value indicates a strong correlation — as height increases, weight also tends to increase linearly. Since height has a high correlation to weight, it is selected as a useful feature to include in modeling weight predictions. Other features like hair color may have very low correlation to weight and would not be selected.

  • ANOVA F-test compares variance between group means to variance within groups. Higher F means group means are farther apart compared to variance within groups.

F = Variance between group means / Variance within groups

Example: A model predicts income level based on education features. An ANOVA F-test finds the income averages for high school, college, and graduate degrees are very different from each other compared to the variation of incomes within each education group. The education features with the highest F-values have the most spread-out income averages, indicating they strongly differentiate income levels. Those features would be selected, while education features that do not show large income average differences between their groups would not be selected.

  • Mutual information between x and y quantifies dependence. Higher MI means x and y are more closely dependent.

MI = Sum of (probability of x and y) * log(probability of x and y) — Sum of probability of x * log probability of x — Sum of probability of y * log probability of y

Example: If feature x, like weekly study hours, has very high mutual information with target y of student GPA, then x provides a lot of information about y and the two variables are highly dependent. Feature x would be selected for its high predictive power of GPA. Other features with lower mutual information, such as number of extracurricular activities, are less useful in predicting GPA.

Wrapper Methods

Wrapper methods select features by testing different combinations on a model. Forward selection starts with no features and adds them one by one, keeping changes that improve performance. Backward elimination starts with all features and removes them one by one. Exhaustive search evaluates all possible combinations. Wrapper methods find subsets tailored for the model, but can be slow and overfit.

  • Exhaustive search tests all possible feature combinations.

Example: With 3 features A, B, and C, exhaustive search evaluates all combinations: no features, A alone, B alone, C alone, A and B, A and C, B and C, and A, B and C together. This complete search guarantees the optimal subset but is infeasible for many features.

  • Forward selection greedily adds features that improve model accuracy.

Example: Starting with no features, add A alone and test if it increases model accuracy. If so, keep A. Then add B to A and test if accuracy increases more. If so, keep A and B. Continue adding each remaining feature one by one, keeping those that improve the model, until no single addition improves accuracy. This provides the optimal subset efficiently, but does not re-evaluate previous selections.

  • Genetic algorithms evolve populations of feature subsets toward an optimal solution.

Example: The algorithm starts with random feature subsets. The top performing subsets are selected and combined through crossover, mixing aspects of the top subsets, and mutation, making random changes. This breeding produces a new generation of feature subsets. Over many generations, subsets evolve toward higher performance. Genetic algorithms provide a heuristic search for the optimal feature subset.

Embedded Methods

Embedded methods learn which features are most useful during model training. Regularization methods like LASSO and ridge regression add penalties for model coefficients, eliminating weak features. Embedded methods are efficient since selection happens as part of model training.

  • LASSO regression adds up the absolute sizes of the coefficients as a penalty:

min (Error + alpha*sum of absolute values of coefficients)

Example: LASSO regression predicts house prices based on features like number of bedrooms, size, location etc. The alpha parameter controls the penalty strength. With higher alpha, more feature coefficients will be shrunk to zero and removed. For instance, with strong regularization, only number of bedrooms and size may remain, while other features get eliminated.

  • Ridge regression adds up the squared coefficients as a penalty:

min (Error + lambda*sum of coefficients squared)

Example: Predicting exam scores based on study time, prep course, and student IQ, ridge regression is used. The lambda parameter controls the coefficient penalty strength. With higher lambda, coefficients for less useful features like prep course will be shrunk closer to zero, while important coefficients like study time remain larger. This removes prep course but retains study time and IQ.

Example Python Code

Here is some sample code showing how to use scikit-learn for feature selection. First, I will show you how to use the Pearson correlation from the Filter methods explained above., then I will use the Exhaustive search on the widely used Iris dataset from the Wrapper methods, and finally I will use the Lasso regression from the Embedded methods. (Note that the code is used for demonstrative purposes; it is not meant to win Kaggle competitions for efficiency, hehe)

Pearson correlation

I will use simple dummy data, show the correlation, and then a decision whether to select the features or not.

from scipy.stats import pearsonr

height = [1, 2, 3, 4, 5]
weight = [3, 4, 5, 9, 10]

# calculate pearson corrrelation
r, p_value = pearsonr(height, weight)
print("Correlation:", r)

# feature selection thresholds
strong_threshold = 0.7
strong_inverse_threshold = -0.7

# feature selection decision
if r > strong_threshold:
print("Strong positive correlation between height and weight.")
print("Select both features.")
elif r < strong_inverse_threshold:
print("Strong negative correlation between height and weight.")
print("Select both features.")
else:
print("Weak correlation between height and weight.")
print("Only select height feature.")

This will output the following:

Correlation: 0.9645788568769382
Strong positive correlation between height and weight.
Select both features.

Exhaustive search

I will use the Iris dataset, and the Decision Tree classifier from Sklearn. As it’s an exhaustive search along 4 features, I will print out all results for every combination, along with a decision whether the feature combination should be used or not.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import itertools

# load iris dataset
X, y = load_iris(return_X_y=True)
feature_names = load_iris().feature_names

# generate feature combinations
combos = itertools.chain(
itertools.combinations(feature_names, 4),
itertools.combinations(feature_names, 3),
itertools.combinations(feature_names, 2),
itertools.combinations(feature_names, 1))

# evaluate feature subsets
for features in combos:

X_temp = X[:,[i for i,f in enumerate(feature_names) if f in features]]

model = DecisionTreeClassifier()
scores = cross_val_score(model, X_temp, y, cv=5)

print("Selected features:", features)
if scores.mean() > 0.90:
print("Accuracy:", scores.mean(),"High accuracy. Select all features.")
else:
print("Accuracy:", scores.mean(), "Low accuracy with features. Do not select.")

This will output the following (results will vary):

Selected features: ('sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)')
Accuracy: 0.9600000000000002 High accuracy. Select all features.
Selected features: ('sepal length (cm)', 'sepal width (cm)', 'petal length (cm)')
Accuracy: 0.9333333333333332 High accuracy. Select all features.
Selected features: ('sepal length (cm)', 'sepal width (cm)', 'petal width (cm)')
Accuracy: 0.9199999999999999 High accuracy. Select all features.
Selected features: ('sepal length (cm)', 'petal length (cm)', 'petal width (cm)')
Accuracy: 0.9600000000000002 High accuracy. Select all features.
Selected features: ('sepal width (cm)', 'petal length (cm)', 'petal width (cm)')
Accuracy: 0.9600000000000002 High accuracy. Select all features.
Selected features: ('sepal length (cm)', 'sepal width (cm)')
Accuracy: 0.7333333333333333 Low accuracy with features. Do not select.
Selected features: ('sepal length (cm)', 'petal length (cm)')
Accuracy: 0.9133333333333333 High accuracy. Select all features.
Selected features: ('sepal length (cm)', 'petal width (cm)')
Accuracy: 0.9266666666666665 High accuracy. Select all features.
Selected features: ('sepal width (cm)', 'petal length (cm)')
Accuracy: 0.8933333333333333 Low accuracy with features. Do not select.
Selected features: ('sepal width (cm)', 'petal width (cm)')
Accuracy: 0.9333333333333332 High accuracy. Select all features.
Selected features: ('petal length (cm)', 'petal width (cm)')
Accuracy: 0.9533333333333334 High accuracy. Select all features.
Selected features: ('sepal length (cm)',)
Accuracy: 0.6933333333333334 Low accuracy with features. Do not select.
Selected features: ('sepal width (cm)',)
Accuracy: 0.5066666666666666 Low accuracy with features. Do not select.
Selected features: ('petal length (cm)',)
Accuracy: 0.9200000000000002 High accuracy. Select all features.
Selected features: ('petal width (cm)',)
Accuracy: 0.9533333333333334 High accuracy. Select all features.

Lasso regression

I will again use the Iris dataset, which I’ve mapped to renamed feature names “A,B,C and D”, and the Lasso model from Sklearn. I’ve adjusted the alpha parameter to output 2 useful non-zero features, with the other features being reduced to 0.

from sklearn.datasets import load_iris
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
import numpy as np

# feature mapping dictionar
feature_mapping = {'A': 'sepal length',
'B': 'sepal width',
'C': 'petal length',
'D': 'petal width'}

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lasso = Lasso(alpha=0.04)
lasso.fit(X_train, y_train)

coefs = lasso.coef_
selected, = np.nonzero(coefs)

# get the selected letters
selected_letters = [list(feature_mapping.keys())[i] for i in selected]
print(feature_mapping)
print("Selected features:")
print(selected_letters)
print("Lasso accuracy:", lasso.score(X_test, y_test))

This will output the following (results will vary):

{'A': 'sepal length', 'B': 'sepal width', 'C': 'petal length', 'D': 'petal width'}
Selected features:
['C', 'D']
Lasso accuracy: 0.9090708820329702

And that’s all! I will leave you with some “fun” trivia 😊

Trivia

  • One of the earliest works in automatic feature selection was published in 1953 by Hassler Whitney. He proposed mathematical criterions for selecting features based on their predictive power, and his method determined which parameters were most relevant for calculating the trajectories of physical objects. This helped to accurately determine the trajectories of astronomical objects like the comet Swift using a minimal set of input parameters.
  • In an important paper in 1989, John, Kohavi and Pfleger introduced wrappers for feature selection. The name of the paper was “Irrelevant Features and the Subset Selection Problem.” This was a major advance over previous filter methods that relied on simple statistical measures independent of any model.

--

--

Wojtek Fulmyk, Data Scientist

Data Scientist, University Instructor, and Chess enthusiast. ML specialist.