Be careful what you SHAP for…

Paul dos Santos
4 min readMay 22, 2020

…you just might get it.

Mixing continuous and binary variables can lead to false ‘insights’ about feature importance. Particularly with decision tree-based methods.

Experiment 1 — Mixed Variable Inputs

Let’s create some random data. 10 continuous variables and 10 binary variables with a binary outcome.

samples = 2500
cont_var = 10
bin_var = 10
X=[]
#make random continuous features
for v in np.arange(cont_var):
v_data = []
scale_var = np.random.randint(1,50)
for i in np.arange(samples):
v_data.append(scale_var * np.random.random())
X.append(v_data)

#make random binary features
for v in np.arange(bin_var):
v_data = []
for i in np.arange(samples):
v_data.append(np.random.randint(0,2))
X.append(v_data)

#make random target variable
y = [np.random.randint(0,2) for i in np.arange(samples)]
#make X the correct shape
X = np.stack(X).T

The first 10 features (0 to 9) are continuous while the remaining 10 (10 to 19) are binary.

Keeping in mind this data is completely random and that there is no real relationship between the variables and target at all; we will now fit an XGBoost Classifier to this data. Because of the random nature of the experiment, we can use this technique to expose underlying biases in the modelling pipeline.

model = xgboost.XGBClassifier()
model.fit(X,y)
ypred = model.predict(X)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X, y=y)

Now let’s see which features were most important…

# summarize the effects of all the features
shap.summary_plot(shap_values, X, max_display=X[0].shape[0])
Continuous + Binary

As you can see, the continuous features come out as being the most important. In fact, all of the continuous variables come out on top. And it’s not wrong! The model is relying more heavily on those features to make its prediction so they are more ‘important’ to its decision making, but in the process we’ve discounted the role the binary variables can play and may end up drawing false conclusions.

This makes intuitive sense if you consider how a decision tree handles these types of variables differently.

Decision tree-based algorithms like XGBoost have an inherent bias towards continuous variables. Continuous variables have more potential split points that can be found to reduce the impurity of a branch. Once a binary encoded variable has been split there’s nothing more to gain. This also means that a continuous variable can appear multiple times at various levels of a tree but a binary feature typically only ever appears once in a tree. The implication of this is that when we use feature importance techniques on decision trees with mixed inputs, you can often find the continuous variables coming out as more “important” because the model is able to fine-tune using these features giving them a higher weighting overall.

Experiment 2 — Use the same encoding scheme for all variables

Let’s see what happens when we bin and one-hot encode our continuous variables.

We will create some new random data for this experiment.

X=[]
#make random continuous features THEN bin & dummy encode
for v in np.arange(cont_var):
print('Continuous Feature: {}'.format(v))
v_data = []
scale_var = np.random.randint(1,50)
for i in np.arange(samples):
v_data.append(scale_var * np.random.random())
#bin variable
bins = np.arange(min(v_data), max(v_data), int(max(v_data)/3))
print('Bin Split Points: {}'.format(bins.round(2)))
v_data = np.digitize(v_data, bins)
#one hot encode
one_hot_v_data = np.zeros((v_data.size, v_data.max()+1))
one_hot_v_data[np.arange(v_data.size),v_data] = 1
#keep k-1
one_hot_v_data = one_hot_v_data[:,1:]
X.append(one_hot_v_data.tolist())

#make random binary features
for v in np.arange(bin_var):
v_data = []
for i in np.arange(samples):
v_data.append(np.random.randint(0,2))
X.append(v_data)

#make random target variable
y = [np.random.randint(0,2) for i in np.arange(samples)]
X_combined = []
for sample in np.arange(samples):
X_tmp = []
for variable in X:
if isinstance(variable[sample], list):
for i in variable[sample]:
X_tmp.append(i)
else:
X_tmp.append(variable[sample])
X_combined.append(np.array(X_tmp))
#make X the correct shape
X = np.stack(X_combined)

All continuous features are now one-hot encoded (variables 0 to 39) while the last 10 binary variables remain (40 to 49). Let’s see what the top 20 ‘important’ features are now.

model = xgboost.XGBClassifier()
model.fit(X,y)
ypred = model.predict(X)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X, y=y)
# summarize the effects of all the features
shap.summary_plot(shap_values, X, max_display=20)
Binary Only

Quite a difference, right?! Since the model was forced to work with all the inputs in the same way, the SHAP values for the top features fall within a smaller, comparable range without any bias towards the original continuous variables.

At the end of the day, all the additive explanations are trying to do is provide a localized explanation of how the fit model has made its decision so it’s prudent to understand the mechanics behind what the model is actually doing.

If the point of your analysis is to explicitly explore the relationship between your variables and a target variable, then consider spending more time working out how to make your inputs behave the same within your modelling framework of choice if you want to give each input a fair chance for a spot at the top.

--

--