Machine Learning Interpretability (MLI) with XGBoost and Additive Tools (SHAP)

10 min readApr 30, 2023

Image from Kirell Benzi, storytelling by art and data

Modern artificial intelligence (AI) or machine learning (ML) methods can be used to build sophisticated models that obtain fantastic prediction performance or classification accuracy in a wide range of challenging domains. However, they typically have a complex, black-box nature, that may not help us understand the data or the task any better. Explainable AI and Interpretable ML is all about making our models more transparent and interpretable, helping us answer the important questions such as: What is important to the model?Why does the model behave the way it does?

Some authors make a distinction between models that are directly interpretable and those that require explanations. Others consider the methods for constructing directly interpretable models to be a subset of techniques sitting under the umbrella concept of XAI. Whilst authors of popluar software for generating model explanations refer to them as interpretability tools. But Patrick Hall and Navdeep Gill define these terms as below:

Explainable machine learning

Getting even more specific, explainable machine learning, or explainable artificial intelligence (XAI), typically refers to post hoc analysis and techniques used to understand a previously trained model or its predictions. Examples of common techni‐ques include:

Reason code generating techniques In particular, local interpretable model-agnostic explanations (LIME) and Shapley values.
Local and global visualizations of model predictions Accumulated local effect (ALE) plots, one- and two dimensional partial dependence plots, individual condi‐tional expectation (ICE) plots, and decision tree surrogate models.
XAI is also associated with a group of DARPA researchers that seem primarily interested in increasing explainability in sophisticated pattern recognition models needed for military and security applications.

Interpretable or white-box models
Over the past few years, more researchers have been designing new machine learning algorithms that are nonlinear and highly accurate, but also directly interpretable, and interpretable as a term has become more associated with these new models.

Performance Vs Explainability

Boosted or bagged decision tree methods such as XGBoost or Random Forest models are built from ensembles of many, often hundreds of decision trees. For these methods, it becomes infeasible to understand its reasoning directly. Without the tools discussed in the next section, it is difficult to even understand how the predictors will affect a prediction.

Figure 1. Model Complexity vs Interpretability

However, we can still easily enough get a sense of what features are important to the model overall, by observing what features get used the most in the ensemble of trees, and how much information is gained by utilising them.

XGBoost Feature Importance

Lets start with a real world data and follow the ML pipeline of prediction with XGBoost. For this post i used Airbnb Listing Price dataset. The problem is a regression task to predict the prices of Airbnb listings with following data. info():

the dataset required some preprocessing. I have encode categoricla with sklearn.preprocessing .LabelEncoder and extract ‘day’, ‘month’ and ‘year’ from last_review temporal feature. Also i dropped following features for more simplicity:

[‘name’, ‘host_name’, ‘neighbourhood’, ‘Location’]

During next step i split the data to train and test such a way that i can measure model performance on test by RMSE. Following is the code for cross val predict of price by XGBoost:

train, test = train_test_split(train, test_size=.15, random_state=42)
print(train.shape, test.shape) 

target = train.pop('price')
target_tst = test.pop('price')

target= np.log1p(target)

import xgboost as xgb
xgb_params = {
    
    'objective': 'reg:squarederror'   , 
    'eta': .06, 
    'max_depth': 7 ,
    'booster':'gbtree', 
    'eval_metric': 'rmse', 
    'subsample': .7,
    'max_leaves': 3, 
    'colsample_bytree': .7, 
    'colsample_bynode':.5, 
    'min_child_weight':1, 
    'lambda':5 , 
    'alpha': 3, 
    'gamma': 0, 
    
}


xgb_scores=[]
oof_xgb = np.zeros(len(train))
pred_xgb = np.zeros(len(test))

importances_gain = pd.DataFrame()
importances_weight = pd.DataFrame()
importances_cover = pd.DataFrame()
importances_total_gain = pd.DataFrame()

folds = KFold(n_splits=5, shuffle=True, random_state=42)

for fold_, (trn_ind, val_ind) in enumerate(folds.split(train, target)):
    
    trn_data = xgb.DMatrix(data=train.iloc[trn_ind], 
                          label=target.iloc[trn_ind])
    val_data = xgb.DMatrix(data=train.iloc[val_ind], 
                          label=target.iloc[val_ind])
    
    xgb_model = xgb.train(xgb_params,
                         trn_data, num_boost_round=3000, 
                         evals=[(trn_data, 'train'), (val_data, 'validation')],
                         verbose_eval=100, 
                         early_stopping_rounds=100) 

    oof_xgb[val_ind] = xgb_model.predict(xgb.DMatrix(train.iloc[val_ind]),
                                        ntree_limit= xgb_model.best_ntree_limit)
    xgb_scores.append(np.sqrt(mean_squared_error(np.expm1(target.iloc[val_ind]), 
                              np.expm1(oof_xgb[val_ind])))) 
    
    pred_xgb += xgb_model.predict(xgb.DMatrix(test), 
                ntree_limit= xgb_model.best_ntree_limit)/folds.n_splits
    
    importance_score = xgb_model.get_score(importance_type='gain')
    importance_frame = pd.DataFrame({'Importance': list(importance_score.values()), 
                                    'Feature': list(importance_score.keys())})
    importance_frame['fold'] = fold_ +1
    importances_gain = pd.concat([importances_gain, 
                                 importance_frame],
                                 axis=0,
                                 sort=False)
    
    importance_score = xgb_model.get_score(importance_type='weight')
    importance_frame = pd.DataFrame({'Importance': list(importance_score.values()), 'Feature': list(importance_score.keys())})
    importance_frame['fold'] = fold_ +1
    importances_weight = pd.concat([importances_weight, importance_frame], axis=0, sort=False)
    
    importance_score = xgb_model.get_score(importance_type='cover')
    importance_frame = pd.DataFrame({'Importance': list(importance_score.values()), 'Feature': list(importance_score.keys())})
    importance_frame['fold'] = fold_ +1
    importances_cover = pd.concat([importances_cover, importance_frame], axis=0, sort=False)
    
    importance_score = xgb_model.get_score(importance_type='total_gain')
    importance_frame = pd.DataFrame({'Importance': list(importance_score.values()), 'Feature': list(importance_score.keys())})
    importance_frame['fold'] = fold_ +1
    importances_total_gain = pd.concat([importances_total_gain, importance_frame], axis=0, sort=False)
    
    
    
                                           
print('mean rmse = ', np.mean(xgb_scores))

Output:

((33250, 14), (5868, 14))
mean rmse =  129.9455925851426

mean_total_gain = importances_total_gain[['Importance', 'Feature']].
                  groupby('Feature').mean()
mean_total_gain = mean_total_gain.reset_index()
plt.figure(figsize=(17, 8))
sns.barplot(x='Importance', y='Feature', 
                            data=mean_total_gain.sort_values('Importance',
                            ascending=False), 
                            palette='gray')

Figure 2. XGB toatal gain feature importance

XGBoost Booster class has a method to identify which feature importance type to apply :

get_score(fmap=’’, importance_type=’weight’)

‘get_score’ gets feature importance of each feature. For tree model Importance type can be defined as:

‘weight’: the number of times a feature is used to split the data across all trees.
‘gain’: the average gain across all splits the feature is used in.’gain’ is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).
‘cover’: the average coverage across all splits the feature is used in.
‘total_gain’: the total gain across all splits the feature is used in.
‘total_cover’: the total coverage across all splits the feature is used in.

So with total_gain importance we can conclude “room_type” and geographic location are the most important features for pricing also “noise” has significant impact in getting better model performance.

Lets see how these feature affected model output by applying SHAP.

SHAP

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions (see papers for details and citations).

Lets use our xgb model to see how features affected the output:

SHAP Bar Plot

The simplest starting point for global interpretation with SHAP is to examine the mean absolute SHAP value for each feature across all of the data. This quantifies, on average, the magnitude (positive or negative) of each feature’s contribution towards the predicted house prices. Features with higher mean absolute SHAP values are more influential. Mean absolute SHAP values are essentially a drop-in replacement for more traditional feature importance measures but have two key advantages:

Mean absolute SHAP values are more theoretically rigorous, and relate to which features impact predictions most (which is usually what we’re interested in). Conventional feature importances are measured in more abstract and algorithm-specific ways, and are determined by how much each feature improves the model’s predictive performance.
Mean absolute SHAP values have intuitive units — for this example, they are quantified in dollars, like the target variable. Feature importances are often expressed in counterintuitive units based on complex concepts such as tree algorithm node impurities.

import shap
explainer = shap.TreeExplainer(xgb_model)
shap.plots.bar(explainer(test), max_display=len(test.columns))

Similar to XGBoost ‘total_gain’ importance we can see “room_type” and geographic location impacts on price are significantly more than features like ‘floor’ or ‘number_of_reviews’.

But how does the ‘room_type’ (Private room, Entire home/apt, Shared room) affect the price prediction ? we will analyse this part in Force Plot section.

SHAP Summary Plot

Summary plot is one of most useful of plots among shap plots. Summary plots are a more complex and information-rich display of SHAP values that reveal not just the relative importance of features, but their actual relationships with the predicted outcome.For more details of the plot i provided following illustration:

import shap
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(test)
shap.summary_plot(shap_values, test, 
                  plot_type="layered_violin", 
                  color='coolwarm')

Output summary plot:

The distribution of points can also be informative. For minimum_nights, we see a dense cluster of low minimum nights instances (blue points) with small but positive SHAP values. Instances of higher minimum nights(red points) extend further towards the left, suggesting high minimum nights has a stronger negative impact on price than the positive impact of low minimum nights on price.

SHAP Force Plot

We want to ansewr the prior question :how does the ‘room_type’ (Private room, Entire home/apt, Shared room) affect the price prediction ?

‘room_type’ encoded as below:

target mapping :   room_type {'Entire home/apt': 0, 'Private room': 1, 'Shared room': 2}

Also we want to analyze the behavior of ‘minimum_nights’ feature and its affection on price. Looking at minimum_night box_plot we can identify significant outliers , so we should remove them by uni-variate anomaly healing.

The result is a data with minimum_nights less than 600 and dividing this feature by ‘room_type’ we can see ‘Shared room’: 2 rented less than ‘Entire home/apt’: 0 and ‘Private room’: 1.

Figure 6. ‘room_type’ vs ‘minimum_nights’ boxplot

We can apply SHAP force plot to see how does each feature e.g. ‘room_type’ affected on xgb model output and price prediction. Force plots are useful for examining explanations for multiple instances of the data at once, as their compact construction allows for outputs to be stacked vertically for ease of comparison.

for first 1000 rows we have:

Figure 8. For 1000 rows , ‘room_type’ = 1 (‘Private room’) is lowering the price and ‘room_type’ = 0 (‘Entire home/apt’) push the price to the higher value

SHAP Dependence Plot

Summary Plot (Beeswarm plots) are information-dense and provide a broad overview of SHAP values for many features at once. However, to truly understand the relationship between a feature’s values and the model’s predicted outcomes, its necessary to examine dependence plots.

Figure below shows dependence plots for the top five features, and reveals that the relationship between SHAP values and variable values are quite different for each of them.

Figure 9. SHAP Dependence Plot for lat and long features

Understanding the Differences

SHAP and XGBoost’s total gain can sometimes give different feature importance scores. Here’s a breakdown of the differences and which one to trust in different scenarios:

XGBoost Total Gain:

Focuses on tree structure and splits. It measures the average improvement in the objective function (e.g., impurity reduction) brought by splitting on a feature at each tree node. It tells you which features are most frequently used for splitting and contribute to the overall model’s performance.
This is a good general indicator of feature importance in tree-based models.
Can be biased towards features that appear earlier in the trees or are good at separating data at high levels.

SHAP (SHapley Additive exPlanations):

Focuses on marginal contributions to individual predictions. It estimates how much a feature’s presence or absence changes a model’s prediction for a specific observation, considering all possible feature combinations.
Provides a more nuanced view of feature importance, especially when features interact.

Choosing Between Them:

General Feature Importance: If you need a broad understanding of feature importance in a tree-based model, XGBoost’s total gain is a good starting point.
Detailed Feature Interpretation: If you need to understand how individual features contribute to specific predictions, especially when there might be feature interactions, then SHAP is a better choice. It provides a more in-depth explanation of how features influence the model’s output for each data point.

When They Might Diverge:

XGBoost total gain can be biased towards features used earlier in the trees or for high-level splits.
SHAP can be more sensitive to feature interactions. It considers all possible feature combinations, which can lead to different importance scores compared to XGBoost, which looks at splits in isolation.
Datasets with correlations between features can also influence the differences.

In Summary:

Start with XGBoost’s total gain for a general understanding.
Use SHAP when you need a deeper understanding of how features contribute to specific predictions, especially for complex models or cases where feature interactions are suspected.

Note

Shapley values enjoy mathematically satisfying theoretical properties as a solution to game theory problems. However, applying a game theoretic framework does not automatically solve the problem of feature importance, and our work shows that in fact this framework is ill-suited as a general solution to the problem of quantifying feature importance. Rather than relying on notions of mathematical correctness, this paper suggests that we need more focused approaches that stem from specific use cases and models, developed with human accessibility in mind [2] .( In Part_II of this series i will dive to shortcomings of SHAP feature importance).

Refrences:

Xuanxiang Huang, Joao Marques-Silva, “The Inadequacy of Shapley Values for Explainability”
Elizabeth Kumar , Suresh Venkatasubramanian, Carlos Scheidegger , Sorelle A. Friedler , “Problems with Shapley-value-based explanations as feature importance measures “
Tianqi Chen, Carlos Guestrin ,” XGBoost: A Scalable Tree Boosting System”
Angeline Yasodhara , Azin Asgarian , Diego Huang, and Parinaz Sobhani , On the Trustworthiness of Tree Ensemble Explainability Methods