“Exploring Wrapper Methods for Optimal Feature Selection in Machine Learning”

ajaymehta
10 min readMay 24, 2023

--

Wrapper Methods

In feature selection, the wrapper method is a technique that selects a subset of features by evaluating a model’s performance using different subsets of features.

The basic idea behind the wrapper method is to use a machine learning model as a black-box function to evaluate subsets of features. The wrapper method generates a set of candidate feature subsets and then uses a model to train and evaluate each subset. Based on the model’s performance, the wrapper method selects the best subset of features.

The wrapper method is computationally expensive and can be prone to overfitting. However, it is a powerful method that can identify complex relationships between features and provide accurate predictions. The wrapper method is often used in applications such as image recognition, speech recognition, and text classification.

Here’s how wrapper methods work in general:

  1. Subset Generation: First, a subset of features is generated. This can be done in a variety of ways. For example, you might start with one feature and gradually add more, or start with all features and gradually remove them, or generate subsets of features randomly. The subset generation method depends on the specific type of wrapper method being used.
  2. Subset Evaluation: After a subset of features has been generated, a model is trained on this subset of features, and the model’s performance is evaluated, usually through cross validation. The performance of the model gives an estimate of the quality of the features in the subset.
  3. Stopping Criterion: This process is repeated, generating and evaluating different subsets of features, until some stopping criterion is met. This could be a certain number of subsets evaluated, a certain amount of time elapsed, or no improvement in model performance after a certain number of iterations.

1. Exhaustive Feature Selection/Best Subset Selection

Here is the notebook where I applied all feature selection of wrapper method.

wrapper-methods.ipynb — Colaboratory (google.com)

Exhaustive feature selection, also known as best subset selection, is a method used to select the best combination of features from a given set of features in a machine learning problem. The goal is to find the subset of features that maximizes the performance of the model.

In this method, all possible combinations of features are evaluated, and the best subset is selected based on a performance metric, such as accuracy or mean squared error. The number of possible combinations grows exponentially with the number of features, following the formula 2^n — 1, where n is the number of features.

Let’s consider an example to understand this better. Suppose we have three features: f1, f2, and f3, and we want to predict a target variable y. The exhaustive feature selection process would involve evaluating all possible combinations of these three features.

2³-1=7 subsets we have to make

The possible combinations (subsets) of features are:

  1. f1
  2. f2
  3. f3
  4. f1, f2
  5. f1, f3
  6. f2, f3
  7. f1, f2, f3

To apply best subset selection, we would train and evaluate a model using each of these combinations. For example, let’s say we are using a linear regression model and our performance metric is mean squared error (MSE). We would train the model using the features in each combination and calculate the MSE for each combination.

After evaluating all possible combinations, we select the subset that gives the best performance. In this case, let’s assume that the subset f1, f2, f3 yields the lowest MSE. Thus, the best subset selected is f1, f2, f3, indicating that these three features provide the most predictive power for our model.

The advantage of exhaustive feature selection is that it guarantees finding the best subset of features in terms of performance. However, it can become computationally expensive as the number of features increases since the number of combinations grows exponentially. Therefore, it may not be practical for datasets with a large number of features.

Alternative methods, such as forward selection and backward elimination, can be used to reduce the computational complexity while still achieving a reasonably good subset of features. These methods incrementally add or remove features based on their individual performance rather than evaluating all possible combinations.

Disadvantage

  1. Computational Complexity: The main drawback of exhaustive feature selection is its computational cost. As the number of features increases, the number of combinations to check grows exponentially (2^n), making the method computationally expensive and time-consuming. This can be addressed by using alternative feature selection techniques that are computationally more efficient, such as forward selection or backward elimination. These methods incrementally add or remove features based on their individual performance, rather than evaluating all possible combinations.
  2. Risk of Overfitting: When evaluating all possible feature combinations, there is a risk of overfitting the model to the training data. Overfitting occurs when the model becomes too complex and captures noise or random fluctuations in the training data, leading to poor generalization on unseen data.

To mitigate the risk of overfitting, it is crucial to use proper techniques such as cross-validation. Cross-validation involves splitting the data into multiple subsets, using one subset for training and the others for validation. This helps estimate the model’s performance on unseen data and prevents overfitting by assessing how well the model generalizes.

Additionally, regularization techniques like L1 (Lasso) or L2 (Ridge) regularization can be employed. Regularization adds a penalty term to the model’s objective function, discouraging the inclusion of unnecessary features or limiting the magnitude of their coefficients. This helps to prevent overfitting by promoting simpler models.

3. Requires a Good Evaluation Metric: The effectiveness of exhaustive feature selection depends on the quality of the evaluation metric used to assess the goodness of a feature subset. If a poor metric is used, the feature selection may not yield optimal results.

One commonly used evaluation metric is the R-squared (R²) score, which measures the proportion of the variance in the target variable explained by the model. However, R² has a limitation: it tends to increase with the addition of more features, even if those features do not contribute significantly to the model’s performance. This can lead to the selection of irrelevant features.

To address this, an alternative metric called adjusted R-squared (adjusted R²) can be used. Adjusted R² takes into account the number of features and the sample size, penalizing the addition of unnecessary features. It provides a more accurate measure of the model’s goodness of fit and helps prevent overfitting.

By incorporating techniques such as cross-validation and using adjusted R-squared as the evaluation metric, the risk of overfitting and the selection of irrelevant features can be mitigated, making the exhaustive feature selection process more reliable and effective.

wrapper-methods.ipynb — Colaboratory (google.com)

2. Sequential Backward Selection/Elimination

wrapper-methods.ipynb — Colaboratory (google.com)

Backward elimination is a wrapper method in feature selection that starts with all features included in a model and then removes features one by one until a subset of features that maximizes the model’s performance is obtained. The backward elimination method is useful when the number of features is relatively small, and the goal is to identify the most important features.

Here’s how the backward elimination method works in detail with an example:

  1. Start with all features included in the model.
  2. Train a model using all the features and evaluate its performance using a performance metric such as accuracy, precision, recall, or F1 score.
  3. Remove one feature at a time and train the model again using the remaining features.
  4. Evaluate the performance of the model with the removed feature and compare it to the performance of the previous model that used all the features.
  5. If the model’s performance improves with the removal of the feature, keep the feature removed. If not, keep the feature included in the model.
  6. Repeat steps 3–5 for all remaining features until no further improvement in the model’s performance is observed.

Let’s consider an example to illustrate the backward elimination method. Suppose we have a dataset of customer information for a company that sells products online. The dataset contains the following features:

  • Age
  • Gender
  • Income
  • Education level
  • Time spent on the website
  • Number of products purchased
  • Customer satisfaction rating

The goal is to predict whether a customer will purchase a product or not based on these features.

To use the backward elimination method, we first start with all features included in the model. We train a machine learning model, such as logistic regression or decision tree, on the entire dataset and evaluate its performance using a performance metric such as accuracy.

Suppose the accuracy of the model is 85%. We then remove one feature at a time and train the model again using the remaining features. We evaluate the performance of the model with the removed feature and compare it to the performance of the previous model that used all the features.

Suppose we remove the “Time spent on the website” feature and train the model again. We evaluate the performance of the model and find that its accuracy drops to 80%. Since the model’s performance has worsened with the removal of this feature, we keep the feature included in the model.

Next, we remove the “Education level” feature and train the model again. We evaluate the performance of the model and find that its accuracy remains at 85%. Since the model’s performance does not improve with the removal of this feature, we remove it from the model.

We repeat these steps for each remaining feature until no further improvement in the model’s performance is observed.

In this example, the backward elimination method identified the most important features for predicting whether a customer will purchase a product or not. The final subset of features may be smaller than the original set, making the model simpler and more interpretable.

3. Sequential Forward Selection

Sequential Forward Selection (SFS) is a feature selection technique that starts with an empty feature set and incrementally adds one feature at a time based on a predefined criterion. It aims to find the best subset of features by iteratively evaluating the performance of different feature combinations. Let’s understand SFS with an example using features f1, f2, f3, and f4.

  1. Initialization: Start with an empty feature set.
  2. Evaluation of Single Features: Evaluate the performance of each individual feature when added to the empty set. In this case, evaluate the performance of f1, f2, f3, and f4 separately.
  • Add f1 to the empty set and evaluate the performance of the model.
  • Add f2 to the empty set and evaluate the performance of the model.
  • Add f3 to the empty set and evaluate the performance of the model.
  • Add f4 to the empty set and evaluate the performance of the model.

3. Based on the predefined criterion, select the feature that performs the best when added individually. Let’s assume that f2 performs the best.

Iteration 1: Add the best performing feature from step 2 (f2) to the feature set.

  • Add f2 to the feature set.
  • Evaluate the performance of the model with f2.
  • Calculate the performance metric (e.g., accuracy, mean squared error, etc.).

4. Evaluation of Feature Pairs: Evaluate the performance of all possible pairs of the current feature set and the remaining features (f1, f3, and f4). In this case, evaluate the performance of (f2, f1), (f2, f3), and (f2, f4).

  • Add (f2, f1) to the feature set and evaluate the performance of the model.
  • Add (f2, f3) to the feature set and evaluate the performance of the model.
  • Add (f2, f4) to the feature set and evaluate the performance of the model.

5. Select the pair that gives the best performance according to the predefined criterion. Let’s assume that (f2, f4) performs the best.

Iteration 2: Add the best performing feature pair from step 4 ((f2, f4)) to the feature set.

  • Add (f2, f4) to the feature set.
  • Evaluate the performance of the model with (f2, f4).
  • Calculate the performance metric.

Repeat Steps 4 and 5: Continue the process by evaluating all possible combinations of the current feature set and the remaining features, selecting the best performing feature combination, and adding it to the feature set.

  • Evaluate the performance of (f2, f4, f1) and (f2, f4, f3).
  • Select the best performing feature combination.
  • Add it to the feature set.

Termination: The process continues until a predefined stopping criterion is met. This criterion can be the desired number of features or a threshold on the performance metric.

  • Evaluate the performance of (f2, f4, f1, f3).
  • Determine if the stopping criterion is met.
  • If not, add it to the feature set and continue the process.

The final result of the SFS algorithm will be the selected subset of features that provides the best performance according to the predefined criterion. In this example, it could be (f2, f4, f1, f3), indicating that these four features together provide the most predictive power for the model.

Advantages and Disadvantages

Advantages

  1. Accuracy: Wrapper methods usually provide the best performing feature subset for a given machine learning algorithm because they use the predictive power of the algorithm itself for feature selection.
  2. Interaction of Features: They consider the interaction of features. While filter methods consider each feature independently, wrapper methods evaluate subsets of features together. This means that they can find groups of features that together improve the performance of the model, even if individually these features are not strong predictors.

Disadvantages

  1. Computational Complexity: The main downside of wrapper methods is their computational cost. As they work by generating and evaluating many different subsets of features, they can be very time-consuming, especially for datasets with a large number of features.
  2. Risk of Overfitting: Because wrapper methods optimize the feature subset to maximize the performance of a specific machine learning model, they might select a feature subset that performs well on the training data but not as well on unseen data, leading to overfitting.
  3. Model Specific: The selected feature subset is tailored to maximize the performance of the specific model used in the feature selection process. Therefore, this subset might not perform as well with a different type of model.

--

--

ajaymehta

Meet Ajay a blogger and AI/DS expert. Sharing insights on cutting-edge tech, machine learning, data analysis, and their real-world applications.