Unlocking the Power of SHAP Analysis: A Comprehensive Guide to Feature Selection

SRI VENKATA SATYA AKHIL MALLADI
7 min readOct 1, 2023

--

Feature selection is a critical step in the machine learning pipeline that involves choosing the most relevant and informative features (variables) from the dataset to build a predictive model.

It helps by selecting relevant features, reducing overfitting, and enhancing model generalization, making it a fundamental step in the model-building process.

SHAP (SHapley Additive exPlanations) is a powerful tool for unveiling the influence of individual features on model predictions. It assigns a value to each feature, showing how much it contributes to the model’s output.

This method has gained significant attention in the machine learning community due to its interpretability, enabling users to understand complex model behaviors and make informed decisions.

Furthermore, SHAP analysis is effective in feature selection by identifying the most impactful variables, aiding in building more efficient and transparent models. Now in this article I will try to explain how we can use SHAP in feature selection for our model.

Calculation of SHAP Values: The Shapley value of a feature value tells us how much that feature contributed to the final result (like a prize or payout) when you consider all the different ways that features can come together.

It’s like figuring out how much credit each feature should get for the outcome, taking into account all the different ways they could have worked together. This helps us understand which features were more important in making the final decision or getting the final result(output).

Analyzing the Worked Example: Below I will be explaining how I used SHAP for feature selection for my model. Also I will explain when and when not to use it.

Above is the dataset we will be working on…

The dataset contains essential attributes of abalones, including their sex and physical measurements (length, diameter, height, whole weight, shucked weight, viscera weight, shell weight) which will be out input columns, along with an age proxy represented by the “Rings” column, which will be our output column.

To leverage SHAP for feature selection in this dataset, we will begin by treating all the provided input columns, representing various abalone attributes, as our potential features. These features will be preprocessed through techniques such as standardization and normalization to ensure uniform scaling.

Next, we’ll select an appropriate machine learning model and fine-tune its hyperparameters to create a robust predictive model. Once the model is established, we can conduct SHAP analysis to identify which features significantly influence the model’s predictions. Features with higher SHAP values positively affect the model’s prediction accuracy, and this analysis will help us prioritize and select the most informative features for our outcome prediction task.

In summary, the process involves data preprocessing, model selection, and SHAP analysis to pinpoint the essential features that contribute positively to the prediction of abalone age, streamlining feature selection for our model.

Model Selection:

In the feature selection process, several machine learning models were tested, and their performance was evaluated using two key metrics: Mean Squared Error (MSE) and R-squared (R²).

Based on the above evaluation of the models we have chosen, LightGBM exhibited the best performance with the lowest Mean Squared Error (MSE) and the highest R-squared (R²) value. ( MSE represents the model’s predictive accuracy, with lower values indicating better performance. R-squared measures how well the model explains the variance in the data, with higher values indicating better explanatory power.)

SHAP :

First lets begin by installing the SHAP library using pip install shap and import it (import shap) into your Python environment.

Now lets fit our chosen LightGBM model to your dataset and create a TreeExplainer object to interpret the model by running explainer = shap.TreeExplainer(model).

Lets utilize SHAP’s visualization and analysis tools to gain insights into our model’s feature importances and prediction explanations, aiding in feature selection and model refinement.

SHAP Values Analysis: The SHAP value plot highlights that “Shell weight,” “Shucked weight,” “Whole weight,” “Sex,” and “Viscera weight” are the most significant features for predicting the output. These features contribute significantly to the model’s predictions compared to others.

Single-Point Force Graph: When examining below single-point force graph, it becomes evident that these significant features positively contribute to predicting a specific output value. This means that an increase in these feature values generally leads to higher predicted outcomes.

Feature Impact on Prediction: Among these influential features, “Shell weight,” “Viscera weight,” and “Shucked weight” exhibit the most substantial impact on predicting this particular value. These features play a predominant role in determining the output for this specific data point, emphasizing their importance in understanding the model’s behavior for individual predictions.

Identifying Top Features: Based on the mean SHAP values in our model, “Shell weight,” “Shucked weight,” and “Whole weight” emerge as the top three most significant features. “Sex” and “Viscera weight” also exhibit notable importance.

Threshold Selection: The conventional threshold for feature importance is typically set at 0.5, indicating a strong positive impact on predictions. However, in our case, applying this threshold led to the removal of 5 out of 8 features and resulted in underfitting. This can occur when you have too few features, and each of them contributes significantly to the outcome.

Adjusting the Threshold: To address the underfitting issue, I decided to lower the threshold to 0.25, allowing us to retain 5 features. By doing this, we aim to strike a balance between feature selection and model complexity. The adjusted threshold may help us maintain a sufficient level of model complexity while avoiding underfitting.

After conducting SHAP analysis and adjusting the feature importance threshold, the final set of selected features for our model consists of ‘Sex,’ ‘Whole weight,’ ‘Shucked weight,’ ‘Viscera weight,’ and ‘Shell weight.’

These features were identified as the most influential in predicting the target variable, and their inclusion in the model aims to strike a balance between interpretability and predictive accuracy, addressing the previous issue of underfitting. This selection represents a carefully considered feature subset that optimizes the model’s performance.

Final Outcome:

Indeed, the outcomes of fitting the model with the selected features show a relatively small but meaningful improvement. The Mean Squared Error (MSE) decreased by 0.1, and the R-squared (R²) increased by 0.01.

Although these changes may appear modest, they are noteworthy in this context because every column in the dataset was equally important, and even a slight positive change can be significant.

This highlights the potential of using the SHAP analysis method for feature selection in various problem domains. By identifying and focusing on the most influential features, it becomes possible to enhance model performance and obtain better results.

This approach can be particularly valuable when dealing with datasets where feature importance varies widely, as it allows you to pinpoint and leverage the key drivers of predictions, ultimately leading to improved model outcomes.

Cons of Shap analysis for Feature Selection:

Computational Complexity: SHAP analysis can be computationally expensive, especially for complex models or large datasets. Calculating SHAP values for every data point and feature combination may require significant computational resources.

When attempting to apply SHAP analysis to a problem involving a larger number of columns without prior feature selection, I encountered significant computational challenges. The process took considerably more time, and in some cases, it failed due to limitations in computational power and available RAM.

This experience highlights the resource-intensive nature of SHAP analysis, particularly when dealing with complex models and high-dimensional datasets. It underscores the importance of optimizing computational resources, employing feature selection strategies, or resorting to parallel computing techniques when working with large datasets to ensure the successful execution of SHAP analysis.

Model Dependency: SHAP analysis is model-agnostic in theory, but the actual implementation often depends on model-specific explanations. Some models may not have straightforward SHAP implementations, limiting its applicability.

Feature Interactions: SHAP values assume feature independence, but in reality, features often interact with each other. SHAP values may not fully capture complex interactions, which can lead to misinterpretations of feature importance

Threshold Selection: Determining the appropriate threshold for feature importance can be subjective and may require experimentation. Setting a threshold too high or too low can lead to suboptimal feature selection.

Resource Intensive for High-Dimensional Data: In high-dimensional datasets, SHAP analysis can become impractical due to the large number of potential feature combinations.

Code:

References:

--

--

SRI VENKATA SATYA AKHIL MALLADI

Grad student @Northeastern University| Master's in Information Systems | Ex Senior Systems Engineer @Infosys