Model Explainability — How to choose the right tool?

Published in

ING Blog

11 min readMay 6, 2021

At ING we put a lot of importance on making sure that the Machine Learning (ML) models we build, are well tested and safe to use. A crucial part of it is explaining and understanding the model. However, there are many explainability tools available, and each of them uses a different methodology. One technique might lead to significantly different results than the other. Which one should you use in your project?

In this post, we present how we have answered that question for our team in ING, in which we mainly use tree-based models. We discuss, which aspects of explainability tools were crucial while making the decision, and compare different approaches.

Global vs. local explainability

Model Explainability is a broad concept of analyzing and understanding the results provided by ML models. It is most often used in the context of “black-box” models, for which it is difficult to demonstrate, how did the model arrive at a specific decision.

Many different tools allow us to analyze the black-box models, and interestingly, each of them looks from a slightly different angle. Some of these techniques focus on a global explanation — a holistic view of the main factors that influence the predictions of this model. An example of a global explanation is an overview of the Feature Importance (FI) of the model. Other tools focus on generating a local explanation, which means focusing on a specific prediction made by the model.

Examples of global and local explanations.

What is essential to understand is each model explainability tool is different. Using two distinct tools for computing global explanations lead to two diverse outcomes, and thus, a different understanding of the model. More importantly, using certain explanation techniques in some cases might lead to strong bias or wrong conclusions.

Having said that, we have decided to conduct research, on which method best suits our needs.

Criteria for choosing the tool

One of the most crucial aspects of choosing the right tool for model explainability is the context, in which you operate. Think about these questions:

Do you work with tabular, text, or image data?
Do you want to get a global or local understanding of the model?
Which aspects of the data, model, the business problem need to be considered?

Addressing these questions allows you to arrive at an initial set of tools to consider, as well as the aspects that might be relevant while selecting the one that suits your needs.

I work at the Risk and Pricing Advanced Analytics team at ING, and the majority of the models that we build are binary classification models. A flagship example is a binary classification model, which aims at detecting defaults, meaning if the customer will or will not manage to pay back the loan.

There are many common characteristics of these problems:

Business and regulators impose the need for thoroughly analysing the model, both globally and locally. In case the customer challenges the automatic loan rejection in court, we need to explain model’s decision,
The explainability tool needs to be safe to use, it should not be misleading under almost any circumstances,
Large tabular datasets used to develop the model, sometimes with millions of samples and thousands of features,
Varying features’ properties: categorical and numeric features, often having missing or correlated values
Global explainability tools are for other purposes as well e.g. feature selection,
Often best performing models are tree-based approaches.

Taking these into consideration, we need to find a flexible, fast, and trustworthy solution, which works well for tabular data and tree-based models. Let’s find the candidate model explainability techniques for our target solution.

Analyzed Methods

Firstly, we will discuss the most often used model explainability tools that work with tabular data and the tree-based models.

Impurity-based Feature Importance

It is one of the most often used methods to get a global understanding of the tree-based model. You might not recognise the name, but it is applied, for example, to compute the feature_importances_ attribute in all the tree-based models in scikit-learn.

Permutation Feature Importance

Another popular technique applied to get an overall understanding of the model is Permutation FI. It is computed, by randomly permuting values of a given feature, and calculating the loss of the model’s performance caused by this distortion.

As an example, let’s assume a model with ROC AUC equal to 90% on the validation set. If you shuffle the requested_loan_amount feature, getting the validation score down to 80%, the importance of this feature is 10%.

To put it simply, the technique presents how much the performance of the model relies on a given feature and its quality.

Leave One Feature Out (LOFO)

This method calculates the feature importance for any model and dataset, by iteratively removing each feature from the set, and computing the model performance.

Coming back to the example of the model with validation ROC AUC of 90%. If you remove the requested_loan_amount feature and retrain the model without it, getting the score down to 80%, the importance of this feature is 10%.

In simple words, it represents how much performance would be lost if a given feature was not available.

The lofo-importance package provides a python implementation of the technique.

SHapley Additive exPlanations (SHAP)

SHAP is a package that, as described in their documentation,

“is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions….”

In a nutshell, SHAP computes Shapley values, which represent the contribution of a given feature towards the prediction of the model for a given sample. The explanation presents how strongly, and which way a given feature affects prediction both locally or globally.

Local Interpretable Model-agnostic Explanations (LIME)

LIME is a package, which focuses on explaining the prediction of a model locally. The explanation is computed, by first, generating random points around the explained sample, computing the model’s output for these points, and training a surrogate model on top of this output.

The surrogate models are simple and explainable (e.g. linear) ML models that are trained to approximate the predictions of the underlying black-box model. By analyzing the surrogate model, you can get insights into the explained model.

For example, let’s assume you want to explain the prediction of a complex tree-based using a Logistic Regression surrogate model. If the Beta coefficient for the feature requested_loan_amount is above 0, the higher values of that feature would result in higher confidence of the positive class.

Lime is especially effective for text and image-based models, however, is also applicable for tabular data.

Analysis

In this section, we will compare the selected model explainability tools, in terms of the most relevant criteria. The table below presents an overview of the results of our analysis. The colors indicate the strengths (green) and weaknesses (red) of the compared tools. We will go over the different comparison dimensions and discuss each of them separately.

Comparison of selected model explainability tools.

Criteria 1: Explanation type

One of the main aspects to consider, when selecting the model explainability tool is, whether it allows you to get local or global explanations. Out of the compared tools, SHAP has the widest range of applications, because it allows computing various plots on the global, as well as the local level.

Criteria 2: Data used

Another crucial aspect is the data used to compute the explanation. In general, it is preferable to compute model explanation on the data held out from the model training. Thanks to this, one can assess how the model acts on the unseen samples. Using only training data may have a severe and misleading effect on the explanation, especially if the model overfits it.

From the compared methods, only the Impurity FI is computed based on the model structure, formed based on the train data. Other tools make use of the data given by the user, or Cross-Validation (CV). SHAP and LIME provide the highest flexibility since the explanation can be computed on any data point.

Criteria 3: Model types supported

In many cases, it is crucial to ensure that the model explainability tool works for any model. In case the tree-based classifiers are outperformed by others, the target solution should be able to explain it as well.

While Impurity FI can only be calculated for tree-based models, the Permutation FI, LOFO, and LIME are model-agnostic tools, which means that they work for any classifier.

SHAP uses various explainers, which focus on analyzing specific types of models. For instance, the TreeExplainer can be used for tree-based models and the KernelExplainer for any model. However, the properties of these explainers significantly differ, for instance, the KernelExplainer is significantly slower than TreeExplainer. Therefore, SHAP works especially well for tree-based models and can be used for other models, but it needs to be done with caution.

Criteria 4: Explanation potential

Explanation potential indicates how complex patterns can be explained, using a given technique.

The first 3 of the compared tools compute the explanations for each feature separately. Therefore, they provide a limited understanding of non-linear patterns that the model has learned.

When it comes to SHAP, Shapley values are computed, based on the individual contribution of a given feature, as well as the interactions between features. Different plots provided by the package, allow to get a different level of insights about the model.

Finally, the complexity of LIME explanation depends on the type of surrogate model used. For linear surrogate models, each feature is analyzed separately, while, for instance, simple tree-based surrogate models explain more complex feature interactions.

Criteria 5: Speed

Another crucial aspect to consider is, how long does it take to compute the explanation. This may affect your project, especially if you work with a relatively large dataset.

To test the speed of the tools, we have set up a simple experiment that aims at measuring the mean runtime of the analyzed tools. We have constructed two simple numerical datasets:

2000 samples with 10 features, split into 50/50 train/validation sets,
20000 samples with 100 features, split into 50/50 train/validation sets,

Later, we have fitted a Random Forest classifier (100 estimators, and max depth of 5) on the train set, and used the validation splits to compute global and local explanations. Thus, the global explanations have been computed on 1000x10 and 10000x100 sets of samples, and local explanations on 10x10 and 10x100 samples. The mean and standard deviation of the run time has been measured over multiple runs. The table below presents the runtime results:

Before we deep dive into the results, let’s keep in mind that the runtime might differ, depending on the machine used, model complexity, parallelization applied, and tool’s settings (e.g. iterations parameter in Permutation FI).

The experiment shows that Impurity FI and SHAP have the best performance in terms of time. For SHAP it only applies to TreeExplainer, if a non-tree-based model was used, the run time would significantly increase.

Using Permutation FI and LOFO is time expensive, and may significantly slow down your project in some cases.

Finally, the run time of LIME is higher than SHAP for local explanations of tree-based models. However, it should not affect the project negatively, because such explanations are typically computed for several samples only.

Criteria 6: Correlated Features

In almost any dataset, there are pairs of highly correlated features. Such pairs of features act in a very similar way and one does not bring much new information over the other one. An excellent example of such would be requested_loan_amount_in_usd and requested_loan_amount_in_euro with Pearson's correlation of 1.

Most of the ML models can efficiently deal with correlated features during training the model. However, many of the explanation tools do not. Let’s assume a situation that in the model, we only use one of the two correlated features mentioned above, and all explainability tools agree that it is the most important feature.

If the model also uses the requested_loan_amount_in_usd, Impurity FI, LOFO, SHAP, and LIME will distribute the importance over these two features for some models, possibly causing that none of them is in the top 5 most important features. Therefore, if you do not remove highly correlated features, these explanations might be negatively affected by them.

This issue has the most severe effect on LOFO since the FI is computed by iteratively removing features. If at one iteration it removes one of the two correlated features, the model still has the other feature to train on. In such a case, for instance, a Decision tree has no loss of performance. Therefore, the feature importance of correlated pairs of features might be driven to 0 in LOFO.

The table below presents a simplified effect of a pair of highly correlated features on the explanations:

Comparison of behaviour of selected tools in case of highly correlated features.

Overall, the correlated features may have a strong effect on the quality of the model explanation. Therefore, it is best to prevent that issue, by removing correlated features, or at least be aware of them, while explaining models.

Criteria 7: High Cardinality Features

High cardinality features are variables that have many distinct values. A categorical feature loan_type has only a couple of possible values, thus it has a low cardinality. In contrast, the numerical feature requested_loan_amount might have thousands of different ones in your dataset, thus we call it high cardinality.

When you train a tree-based model, high-cardinality features have a significantly higher chance of being selected higher in the tree, only because there are many more places for placing the feature split in the tree node. This causes that Impurity metrics calculation might be significantly affected by it. This post presents an experiment, in which a completely random feature scores a high value of Impurity FI.

Thus, the tool is not trustworthy for numerical datasets and should be avoided. The other analyzed tools, are safe to use with high cardinality features in the dataset.

Criteria 8: Unrealistic Data Points

The final dimension that we have considered, is whether the method uses unrealistic data points to compute model explanation.

Let’s assume that we a sample with two features has_credit_card = True and credit_card_saldo=1000. If you apply Permutation FI or LIME, these methods may perturb the original sample to the following state has_credit_card = False and credit_card_saldo=1000, which is contradictory. Using such a sample to make any conclusions, might lead to bias. Thus, in Permutation FI, if permuting has_credit_card leads to a dramatic loss of performance, this might mean that either the feature is important or the model is sensitive to unrealistic situations, caused by low data quality.

In some domains and for some datasets, the issue described above has low severity. However, when building credit risk-related models, it is better to stick to methods that use the original dataset (SHAP and LOFO) and analyze situations that may materialize.

The Verdict

Having considered the context, in which our team explains the models, and the above analysis, we have decided to use SHAP as the best practice in our projects. It is characterized by high flexibility, speed, and safety of use in our projects.

We use SHAP in various ways and have built tools on top of it in probatus, an open-source package developed by ING. See examples on how we use SHAP to:

Generate all the important model explanation plots faster,
Perform feature selection process,
Analyse the difference between two samples of data, for instance, data shift caused by Covid-19.