Efficient Feature Selection via CMA-ES (Covariance Matrix Adaptation Evolution Strategy)

Using evolutionary algorithms for fast feature selection with large datasets

Florin Andrei
Towards Data Science
11 min readJan 12, 2024

--

This is part 1 of a two-part series about feature selection. Read part 2 here.

For more examples of the applications of CMA-ES, check this paper by Nomura and Shibata; this article is mentioned (ref. [6]) in the paper as a noteworthy application of optimization via CMA-ES with Margin.

When you’re fitting a model to a dataset, you may need to perform feature selection: keeping only some subset of the features to fit the model, while discarding the rest. This can be necessary for a variety of reasons:

  • to keep the model explainable (having too many features makes interpretation harder)
  • to avoid the curse of dimensionality
  • to maximize/minimize some objective function related to the model (R-squared, AIC, etc)
  • to avoid a poor fit, etc.

If the number of features N is small, an exhaustive search may be doable: you can literally try all possible combinations of features, and just keep the one that minimizes the cost / objective function. But if N is large, then an exhaustive search might be impossible. The total number of combinations to try is 2^N which, if N is larger than a few dozen, becomes prohibitive — it’s an exponential function. In such cases you must use a heuristic method: explore the search space in an efficient way, looking for a combination of features that minimizes the objective function you use to conduct the search.

What you’re looking for is a vector [1, 0, 0, 1, 0, 1, 1, 1, 0, ...] of length N, with elements taking values from {0, 1} . Each element in the vector is assigned to a feature; if the element is 1, the feature is selected; if the element is 0, the feature is discarded. You need to find the vector that minimizes the objective function. The search space has as many dimensions N as there are features; the only possible values along any dimension are 0 and 1.

Finding a good heuristic algorithm is not trivial. The regsubsets() function in R has some options you can use. Also, scikit-learn offers several methods to perform a heuristic feature selection, provided your problem is a good fit for their techniques. But finding a good, general purpose heuristic — in the most general form — is a hard problem. In this series of articles we will explore a few options that may work reasonably well even when N is large, and the objective function can be literally anything you can compute in code, provided it’s not too slow.

Dataset and full code

For all optimization techniques in this series of articles, I am using the very popular House Prices dataset on Kaggle (MIT license) — after a few simple feature transformations, it ends up having 213 features (N=213) and 1453 observations. The model I’m using is linear regression, statsmodels.api.OLS() , and the objective function I am trying to minimize is BIC — the Bayesian Information Criterion — a measure of information loss, so a lower BIC is better. It is similar to AIC — the Akaike Information Criterion — except BIC tends to produce more parsimonious models: it prefers models with fewer features. Minimizing either AIC or BIC tends to reduce overfitting. But other objective functions could also be used instead, e.g. R-squared (the explained variance in the target) or adjusted R-squared — just keep in mind that a larger R-squared is better, so that’s a maximization problem.

Ultimately, the choice of objective function is irrelevant here. What matters is that we have one objective function that we’re consistently trying to optimize using various techniques.

The full code used in this series of articles is contained in a single notebook in my feature selection repository — also linked at the end. I will provide code snippets within the text here, but please check the notebook for the full context.

We’re going to try and minimize BIC via feature selection, so here is the baseline value of BIC from statsmodels.api.OLS(), before any feature selection is done — with all features enabled:

baseline BIC: 34570.166173470934

And now let’s check a well-known, tried-and-true feature selection technique, which we will compare with the more complex techniques described later.

SFS — Sequential Feature Search

SFS, the forward version, is fairly simple. It starts by trying to choose a single feature, and it selects that feature that minimizes the objective function. Once a feature is selected, it stays selected forever. Then it tries to add another feature to it (for a total of 2 features), in such a way as to minimize the objective again. It increases the number of selected features by 1, every time trying to find the best new feature to add to the existing collection. The search ends when all features are tried together. Whichever combination minimizes the objective, wins.

SFS is a greedy algorithm — each choice is locally optimal — and it never goes back to correct its own mistakes. But it’s reasonably fast even when N is quite large. The total number of combinations it tries is N(N+1)/2 which is a quadratic polynomial (whereas an exhaustive search needs to perform an exponential number of trials).

Let’s see what the SFS code may look like in Python, using the mlxtend library:

import statsmodels.api as sm
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.base import BaseEstimator

class DummyEstimator(BaseEstimator):
# mlxtend wants to use an sklearn estimator, which is not needed here
# (statsmodels OLS is used instead)
# create a dummy estimator to pacify mlxtend
def fit(self, X, y=None, **kwargs):
return self

def neg_bic(m, X, y):
# objective function
lin_mod_res = sm.OLS(y, X, hasconst=True).fit()
return -lin_mod_res.bic

seq_selector = SFS(
DummyEstimator(),
k_features=(1, X.shape[1]),
forward=True,
floating=False,
scoring=neg_bic,
cv=None,
n_jobs=-1,
verbose=0,
# make sure the intercept is not dropped
fixed_features=['const'],
)

n_features = X.shape[1] - 1
objective_runs_sfs = round(n_features * (n_features + 1) / 2)
t_start_seq = time.time()
# mlxtend will mess with your dataframes if you don't .copy()
seq_res = seq_selector.fit(X.copy(), y.copy())
t_end_seq = time.time()
run_time_seq = t_end_seq - t_start_seq
seq_metrics = seq_res.get_metric_dict()

It runs through the combinations quickly and these are the summary results:

best k:         36
best objective: 33708.98602877906
R2 @ best k: 0.9075677543649224
objective runs: 22791
total run time: 42.448 sec

The best number of features is 36 out of 213. The best BIC is 33708.986 (baseline value before feature selection is 34570.166), and it takes less than 1 minute to complete on my system. It invokes the objective function 22.8k times.

These are the best BIC and R-squared values, as functions of the number of features selected:

BIC and R-squared for SFS
BIC and R-squared for SFS

For more information, such as the names of the features actually selected, check the notebook in the repository.

Now let’s try something more complex.

CMA-ES (Covariance Matrix Adaptation Evolution Strategy)

This is a numeric optimization algorithm. It’s in the same class as genetic algorithms (they’re both evolutionary), but CMA-ES is quite distinct from GA. The algorithm is stochastic, and it does not require you to compute a derivative of the objective function (unlike gradient descent, which relies on partial derivatives). It is computationally efficient, and it’s used in a variety of numeric optimization libraries such as Optuna. I will only attempt a brief sketch of CMA-ES here; for more detailed explanations, please refer to the literature in the links section at the end.

Consider the 2-dimensional Rastrigin function:

Rastrigin function

The heatmap below shows the values of this function — brighter colors mean a higher value. The function has the global minimum in the origin (0, 0), but it’s peppered with many local extremes. We need to find the global minimum via CMA-ES.

the Rastrigin function
the Rastrigin function heatmap

CMA-ES is based on the multivariate normal distribution. It generates test points in the search space from this distribution. You will have to guess the original mean value of the distribution, and its standard deviation, but after that the algorithm will iteratively modify all these parameters, sweeping the distribution through the search space, looking for the best objective function values. Here’s the original distribution that the test points are drawn from:

CMA-ES distribution

xi is the set of points that the algorithm generates at each step, in the search space. lambda is the number of points generated. The mean value of the distribution will be updated at every step, and hopefully will converge eventually on the true solution. sigma is the standard deviation of the distribution — the spread of the test points. C is the covariance matrix: it defines the shape of the distribution. Depending on the values of C the distribution may have a “round” shape or a more elongated, oval shape. Changes to C allow CMA-ES to “sneak” into certain areas in the search space, or avoid other areas.

first generation of test points

A population of 6 points was generated in the image above, which is the default population size picked by the optimizer for this problem. This is the first step. After this, the algorithm needs to:

  • compute the objective function (Rastrigin) for each point
  • update the mean, the standard deviation, and the covariance matrix, effectively creating a new multivariate normal distribution, based on what it has learned from the objective function
  • generate a new set of test points from the new distribution
  • repeat until some criterion is fulfilled (either converge on some mean value, or exceed the maximum number of steps, etc.)

I will not show the updates for all distribution parameters here, or else this article will become very large — check the links at the end for a complete explanation. But updating just the mean value of the distribution is quite simple, and it works like this: after computing the objective function for each test point, weights are assigned to the points, with the larger weights given to the points with a better objective value, and a weighted sum is calculated from their positions, which becomes the new mean. Effectively, CMA-ES moves the distribution mean value towards the points with a better objective value:

updating the CMA-ES distribution mean

If the algorithm converges to the true solution, then the mean value of the distribution will converge to that solution. The standard deviation will converge to 0. The covariance matrix will change the shape of the distribution (round or oval), depending on the geography of the objective function, extending into promising areas, and shying away from bad areas.

Here’s an animated GIF showing the evolution in time of the test points while CMA-ES is solving the Rastrigin problem:

animation showing the convergence of CMA-ES

CMA-ES for feature selection

The 2D Rastrigin function was relatively simple, since it only has 2 dimensions. For our feature selection problem, we have N=213 dimensions. Moreover, the space is not continuous. Each test point is an N-dimensional vector with component values from {0, 1} . In other words, each test point looks like this: [1, 0, 0, 1, 1, 1, 0, ...] — a binary vector. But other than that, the problem is the same: we need to find the points (or vectors) that minimize the objective function: the BIC parameter of an OLS model.

Here’s a simple version of the CMA-ES code for feature selection, using the cmaes library:

def cma_objective(fs):
features_use = ['const'] + [
f for i, f in enumerate(features_select) if fs[i,] == 1
]
lin_mod = sm.OLS(y_cmaes, X_cmaes[features_use], hasconst=True).fit()
return lin_mod.bic


X_cmaes = X.copy()
y_cmaes = y.copy()
features_select = [f for f in X_cmaes.columns if f != 'const']

dim = len(features_select)
bounds = np.tile([0, 1], (dim, 1))
steps = np.ones(dim)
optimizer = CMAwM(
mean=np.full(dim, 0.5),
sigma=1 / 6,
bounds=bounds,
steps=steps,
n_max_resampling=10 * dim,
seed=0,
)

max_gen = 100
best_objective_cmaes_small = np.inf
best_sol_raw_cmaes_small = None
for gen in tqdm(range(max_gen)):
solutions = []
for _ in range(optimizer.population_size):
x_for_eval, x_for_tell = optimizer.ask()
value = cma_objective(x_for_eval)
solutions.append((x_for_tell, value))
if value < best_objective_cmaes_small:
best_objective_cmaes_small = value
best_sol_raw_cmaes_small = x_for_eval
optimizer.tell(solutions)

best_features_cmaes_small = [
features_select[i]
for i, val in enumerate(best_sol_raw_cmaes_small.tolist())
if val == 1.0
]
print(f'best objective: {best_objective_cmaes_small}')
print(f'best features: {best_features_cmaes_small}')

The CMA-ES optimizer is defined with some initial guesses for the mean value and for sigma (standard deviation). It then loops through many generations, creating test points x_for_eval , evaluating them with the objective, modifying the distribution (mean, sigma, covariance matrix), etc. Each x_for_eval point is a binary vector [1, 1, 1, 0, 0, 1, ...] used to select the features from the dataset.

Please note that the CMAwM() optimizer is used (CMA with margin) instead of the default CMA(). The default optimizer works well for regular, continuous problems, but the search space here is high-dimensional, and only two discrete values (0 and 1) are allowed. The default optimizer gets stuck in this space. CMAwM() enlarges a bit the search space (while the solutions it returns are still binary vectors), and that seems enough to unblock it.

This simple code definitely works, but it’s far from optimal. In the companion notebook, I have a more complex, optimized version of it, which is able to find better solutions, faster. But the code is big, so I will not show it here — check the notebook.

The image below shows the history of the complex, optimized CMA-ES code searching for the best solution. The heatmap shows the prevalence / popularity of each feature at every generation (brighter = more popular). You can see how some features are always very popular, while others fall out of fashion quickly, and yet others are “discovered” later. The population size picked by the optimizer, given the parameters of this problem, is 20 points (individuals), so feature popularity is averaged across these 20 points.

CMA-ES optimization history

These are the main stats of the optimized CMA-ES code:

best objective:  33703.070530508514
best generation: 921
objective runs: 20000
time to best: 48.326 sec

It was able to find a better (smaller) objective value than SFS, it invoked the objective function fewer times (20k), and it took about the same time to do it. This is a net gain over SFS by all metrics.

Again, the baseline BIC before any feature selection is:

baseline BIC: 34570.166173470934

As a side note: after studying traditional optimization algorithms (genetic algorithms, simulated annealing, etc), CMA-ES was a pleasant surprise. It has almost no hyperparameters, it’s computationally lightweight, it only needs a small population of individuals (points) to explore the search space, and yet it performs pretty well. It’s worth keeping it in your toolbox if you need to solve optimization problems.

This is part 1 of a two-part series about feature selection. Read part 2 here.

Notes and links

Thank you, cmaes team, for unblocking me. Your explanations really helped!

All images were created by the author.

Repository with all code: https://github.com/FlorinAndrei/fast_feature_selection

The House Prices dataset (MIT license): https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

The mlxtend library: https://github.com/rasbt/mlxtend

The cmaes library: https://github.com/CyberAgentAILab/cmaes

cmaes : A Simple yet Practical Python Library for CMA-ES — a paper by Nomura M. and Shibata M. (2024) describing practical applications of CMA-ES as an optimization algorithm: https://arxiv.org/abs/2402.01373

The CMA Evolution Strategy: A Tutorial — a paper by Hansen N. (2016) describing CMA-ES in detail: https://arxiv.org/abs/1604.00772

The Wikipedia entry on CMA-ES: https://en.wikipedia.org/wiki/CMA-ES

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Florin Andrei
Florin Andrei

Written by Florin Andrei

BS in Physics. MS in Data Science. Over a decade experience with cloud computing.

Responses (1)