Analytics Vidhya
Published in

Analytics Vidhya

Clearing the Black Box: Feature Importance with SHAP

Photo by Andrea Piacquadio on Pexels

The Black Box

Machine and Deep Learning are incredible tools for creating predictions. However, as they become increasingly more complex, we become less and less certain of what exactly is going on within them. This is where the term “black box” comes in. It is the idea of throwing a bunch of input into a box, not knowing what happens inside, and having it toss out an answer.

While it is great to have received an answer, what happened is less clear. For example, how did the chosen features affect the model?

One library that dives into the black box to retrieve answers is SHAP.

Feature Effect with Pipeline and GridSearchCV

The dataset we’ll be viewing for this example will be a Kaggle dataset on speed dating. We’ll be looking at user information obtained prior to dating to see how those factors contribute to an individual’s decision to date or not.

After cleaning the data, we’ll set up a Pipeline and GridSearch to test out a few Logistic Regression hyperparameters and return the best model.

Great! We run this to scale our data and find the best parameters to use for the Logistic Regression model. Now, maybe we’re looking to make inferences on how each features affects the target. What can we pull out from a Logistic Regression model?

Coefficients! Logistic Regression assigns a coefficient to each feature to act as its effect within the model. Here, we pull out those values:

Retrieving coefficients

Okay, great, we have the coefficients. But what do they mean? Well here’s the problem. Those coefficients denote the feature’s effect within the model in relation to every other feature. This means that adding or removing features will change those coefficients, potentially from positive to negative. So what can we see?

The only things we can gleam from them is the magnitude of effect the coefficient had (the size of the coefficient) as well as its direction (negative or positive). However, this is all in relation to each feature within the model. In the end, there is not much we can gleam about the features.

This is where SHAP comes in.

Feature Explanations with SHAP

An explanation for what exactly SHAP values are can be found here. However, as a brief explanation, it computes the feature’s effect on the target by looking at the difference in effect it has when combined with other features.

In the case of our dataset, that would be a feature’s average contribution to whether an individual said yes or no to dating.

First, we retrieve the SHAP values.

Retrieving the SHAP values

Note: The first parameter is your model. In this case, we used GridSearchCV and Pipeline. The steps below explain retrieving the model:

Retrieving the best model from GridSearchCV

Now having retrieved the SHAP values, we can take a look at the features.

One of the first things to note is that in addition to the SHAP value, a way of rating feature impact, we can also see the feature’s value. As such, we can see how the value of the feature itself affected the model.

In addition, features are listed in order of impact. So for this model, attr2_1 has had the greatest impact on the Logistic Regression model.

How about we take a look at the first three?

First, all of the features begin with the string “attr.” For these surveys, all three columns had to do with physical attraction.

Attr2_1 represents how individuals thought the opposite gender viewed physical attraction. A higher rating indicated higher importance.

With this, we can see that the more importance individuals believed physical attraction had, the more likely they were to say yes to dating an individual.

On the other hand, attr1_1 and attr4_1 are how important an individual rated physical attraction as well as what an individual believed most others looked for, respectively.

We can see that the lower the rating, the more it pushes the model to predicting the individual will say no to a date.

In fact, returning to the image:

We can see that of the top nine (the ten is all remaining features), the “2_1” section comes up quite a bit. Despite the model containing attribute ratings for what people looked for, the model is more highly influenced by the attribute ratings that people thought the opposite gender looked for, as indicated by the data’s key.

Insights

Compared to looking at list of coefficients, we can get a lot more insight from just one aspect of SHAP.

SHAP’s power is not only used for machine learning algorithms. It can be utilized within deep learning as well, including image identification with Keras.

SHAP with Image Identification from SHAP repository

Here, we can see the different aspects that Keras used in identifying between these two animals.

This highlights the versatility of SHAP. Being able to view which features, even in images, are most important allows the user to better quantify their results.

Here are the resources used which can be viewed for a deeper dive into SHAP:

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Isana Mizuma

Educator turned Data Scientist | Experience in Python, Data Analysis and Modeling, Machine Learning, Deep Learning, SQL, PostgreSQL