Agnostic explainable artificial intelligence (XAI) : An introduction with simple code examples

Olivier Caelen
The Modern Scientist
12 min readNov 25, 2022
Photo by Tim Johnson on Unsplash

Introduction

The field of Explainable Artificial Intelligence (XAI) studies the techniques that allow humans to understand the predictions made by machine learning models or, more generally, the decisions made by AI systems. This field is not new and dates back to the origins of AI. However, since 2015, there has been a resurgence of interest in the areas of XAI search (see figure on the evolution of Google Search for the term ‘explainable ai’), due to the increasing adoption of complex AI solutions in society like deep learning models.

Image from Google Trends for the term ‘explainable ai

The increasing complexity, coupled with the increased degree of interaction of AI decision-making systems with users who are not necessarily data experts, makes explainability a major concern in the design of AI solutions.

Explainability methods can be mainly divided into four categories on two axes. The first axis defines whether the explainability is local or global. When the explainability is made for an individual prediction, the explainability is said to be local. What we are looking for is to understand which input variables contributed most to this prediction. In the global approach, we rather try to understand which variables are the most useful for the model as a whole. The second axis defines whether the explainability method is specific to a type of machine learning algorithm or is agnostic to the algorithm used. The advantage of using agnostic methods is that you can easily compare results from completely different types of models. It is for example possible to compare, with the same XAI method, the results from a very complex deep learning model with the results from a more simple k nearest neighbor algorithm. It is more difficult to make this comparison if each explainable method is specific to the machine learning algorithm used.

The following table gives some examples of explainability methods in each category.

Image by author

In this blog, we will show the use of agnostic explainability methods on a simple example. In all examples, we will always use the same predictive model. In the next section, we will present the data and the model that we will use in the rest of the blog.

The model for experiments

As already mentioned above, in the following examples we will always use the same prediction model that has been trained on the well-known California housing dataset. This dataset contains 20640 rows and 8 input variables. The target variable is the median value of homes in California districts, expressed in $100,000. You can find more information on this dataset here. After loading the dataset, we randomly divide it in two to obtain a training set and a testing set.

import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

housing = fetch_california_housing()

X = pd.DataFrame(data=housing['data'], columns=housing['feature_names'])
y = housing['target']


X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=0.2, random_state=42)
X_tr
y_tr

To build the prediction model, we use the training dataset X_tr, y_tr with a GradientBoosting regressor. Note that since we will use agnostic methods, the type of model used does not really matter. As an indication, we give also the mean square error on the training and testing set. As mentioned here, it is always important to evaluate a model before explaining it:

Warning: Features that are deemed of low importance for a bad model (low cross-validation score) could be very important for a good model. Therefore it is always important to evaluate the predictive power of a model using a held-out set (or better with cross-validation) prior to computing importances.

from sklearn import ensemble
from sklearn.metrics import mean_squared_error

params = {
"n_estimators": 500,
"max_depth": 4,
"min_samples_split": 5,
"learning_rate": 0.01,
"loss": "squared_error",
}

reg = ensemble.GradientBoostingRegressor(**params, random_state=42)
reg.fit(X_tr, y_tr)

mse_tr = mean_squared_error(y_tr, reg.predict(X_tr))
print("The mean squared error (MSE) on training set: {:.4f}".format(mse_tr))

mse_ts = mean_squared_error(y_ts, reg.predict(X_ts))
print("The mean squared error (MSE) on testing set: {:.4f}".format(mse_ts))

Permutation Importance

This technique attempts to identify the input variables that your model considers to be important. Permutation importance is an agnostic and a global (i.e., model-wide) technique that evaluates input features in terms of their usefulness in predicting a target variable across the model’s prediction space. It is defined as the decrease in the accuracy score of a model when a single feature value is randomly shuffled in the testing set. The decrease in model score indicates how much the model depends on the feature.

If we randomly shuffle a single column of a testing dataset (e.g., Population), leaving the target and all other columns in place, how would that affect the accuracy of the predictions in this now mixed data?

The reasoning behind this method is that by randomly shuffling a variable, we remove all useful information contained in this variable. If the variable was very important to the prediction model, then we would expect the performance of the model to be strongly negatively impacted. This method has the advantage of being intuitive and easy to understand.

The process of Permutation Importance is as follows:

  1. Get a trained predictive model and calculate its accuracy on the testing set.
  2. Shuffle the values in a column of the testing set, and calculate its accuracy on the resulting dataset. Measure how much the accuracy has suffered due to this shuffling. This deterioration in performance is a measure of the importance of the variable just shuffled.
  3. Put the data back in the original order (undo the random set from step 2). Now repeat step 2 with the next column in the data set, until you have calculated the importance of each column.

Steps 2 and 3 can be repeated several times to get more information about each variable.

Scikit-learn provides an easy ways to calculate the Permutation Importance. We have to provide to this function: the predictive model reg , the testing set X_ts and y_ts , the number of times that steps 2 and 3 are repeated n_repeats=10 , and the scoring function used to estimate the prediction accuracy scoring='neg_mean_squared_error'.

from sklearn.inspection import permutation_importance


res_pi = permutation_importance(reg, X_ts, y_ts, n_repeats=10, n_jobs=2,
scoring='neg_mean_squared_error', random_state=42)
res_pi

If we look at output of the permutation_importnacef function (res_pi ):

  • importances_mean: averages of the loss of accuracy after the shuffles calculated for the 10 repetitions.
  • importances_std: same for the standard deviation.
  • importances: the raw data used to calculate the means and standard deviations.

It is easier to interpret the results visually.

import matplotlib.pyplot as plt

sorted_idx = res_pi.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(
res_pi.importances[sorted_idx].T, vert=False, labels=X_ts.columns[sorted_idx]
)
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()

The variables at the top are the most important, and those at the bottom are the least important. The values on the x-axis indicate how much the model performance decreased with random shuffling (in this case, using “ neg_mean_squared_error” as the performance metric). In this figure, the variable MedInc (i.e., median income in block group) seems to be the most important for our model.

Partial Dependence Plots (PDP)

This technique attempts to identify how each feature affects your predictions. The easiest way to understand the PDP is perhaps to start with an example. The PDP produce the following type of plot where we can see the average evolution of the model output according to one of the variables.

Here we can see that house prices tend to decrease as latitude increases, with a plateau between 35 and 37. This is useful for answering questions such as: controlling for all other housing characteristics, what is the impact of latitude on house prices? Partial dependency graphs can show whether the relationship between the target and a feature is linear, monotonic or more complex.

Let’s see how it works.

  • Let X=(X_1, X_2, …, X_n) is a random vector of size n representing the n input variables of the predictive model h.
  • Let X_S be the set of input variables of interest and let X_C be its complement. Usually, there are only one or two features in the set X_S. In our example above, X_S represents the variable latitude and X_C contains the other 7 variables.
  • Let us define the following function which measures the average evolution of the model output when the value of X_S is set to x_s:
  • The partial dependence is defined as the following function the second term of the equation allows the Y axis of the above figure to be centered on zero:
  • To obtain the above figure, the partial dependence function pd() is calculated for all latitude values between 33 and 39.
  • Since this expectation cannot be calculated directly because the distribution of X must be known, the partial dependence function must be estimated. Let x^(i) be the actual value of the observation i from the testing dataset and let N be the number of samples in the testing set. The partial dependence is estimated as follows:

The following figure helps to understand what the first term of this formula does.

Image by author

It is a matter of using the predictive model h on the testing dataset where the values of the column whose evolution we want to analyze have been forced to be equal to x_s. The other columns remain unchanged. On each line, we made a prediction whose average is calculated at the end.

Scikit-learn provides an easy to use function to calculate the partial dependence plots (PDP). We have to provide to this function: the predictive model reg , the input of the testing set X_ts . In the example below, we also provide sorted_idx (already calculated earlier) so that the values are in the same order as in the permutation importance and ax to display them in the right figure.

from sklearn.inspection import PartialDependenceDisplay
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(14, 14))
PartialDependenceDisplay.from_estimator(reg, X_ts, sorted_idx, ax=ax)
plt.show()

This figure shows the average evolution of the model predictions when the different variables evolve. If we compare this result with the permutation importance, we see that indeed the population variable does not seem to have much impact on house prices in California districts. On the other hand, the variable MedInc seems to have a lot of impact and now we can also see that it is a proportional impact.

It is also easy to consider two variables at the same time with partial dependence plots which allows to take into account the interactions between the variables.

PartialDependenceDisplay.from_estimator(reg, X_ts, [['Longitude','Latitude']])

SHapley Additive exPlanations (SHAP) Value

This technique attempts to understand individual predictions. The SHAP method is based on Shapley values, a well-known measure in economics and game theory for evaluating the impact of different participants in a cooperative game.

In a cooperative game, players have the possibility to forge coalitions to achieve a common goal. One difficulty in the theory of cooperative games is the distribution of benefits among the players. Suppose a coalition of three players has generated a profit of 100 credits, how do you “fairly” distribute this profit? Shapley value provides a way to distribute these profits fairly. For a mathematician, “distribute fairly” means nothing. We need axioms that correctly express what is “fairly” and for that, four axioms have been defined mathematically.

  • Efficiency: all revenues (costs) of the coalition are redistributed among the players (no more, no less).
  • Symmetry: players who make the same contribution receive the same share.
  • Additivity: if you bring more, you get more.
  • Dummy player: those who contribute nothing, receive nothing.

It has been proven by Lloyd Shapley that, in a cooperative game, the Shapley value is the unique solution that respects the desired properties.

The SHAP in the XAI context is a local and agnostic method. We treat each input variable like a player in a game where the coalition game is the model that will use the contributions of each player to make a prediction. The payout of this game is the prediction and we will use the Shapley value to divide this payout fairly between the players (i.e., the input variables).

In the example below, a black box machine learning model uses four input variables. Each variable makes a contribution to a cooperative game (i.e., the model) and the output of the game is the prediction.

Image by author

The shap module of Python allows to obtain easily the contribution of each variable during a prediction. As shap is a local explanation technique, we start by choosing an observation in the test set and we use the model to make a prediction on it (i.e.: 2.3390358700186686). We will now try to understand why the model gives this prediction to this observation.

row_to_show = [6]
data_for_prediction = X_ts.iloc[row_to_show]

print('The data for prediction:')
print(data_for_prediction)

print('\n\nThe prediction:', reg.predict(data_for_prediction)[0])

The exact calculation of the Shapley value is computationally very complex. Although SHAP is a agnostic XAI method, there are several ways to compute the Shapley value:

  • Kernel Shap: Agnostic method that works with all types of models, but tends to be slower and less accurate to estimate the Shapley value.
  • Tree Shap: faster and more accurate than Kernel Shap but only works with the tree models.
  • Deep Shap: faster and more accurate than Kernel Shap but only works with deep learning models.

As in our case, the model reg is a GradientBoosting regressor, we use the Tree Shap. Note that we could also have used Kernel Shap because this method works with all types of models

import shap

explainer = shap.TreeExplainer(reg)
shap_values = explainer.shap_values(data_for_prediction)
shap_values

These values are the 8 Shapley values of the 8 input variables for the above predictions (i.e.: 2.3390358700186686). Some Shapley values are positive and others negative. Positive values contribute to a higher output value of the predictive model and vice versa for negative Shapley values. For example the Shaply value of MedInc is 0.30874621 and the Shaply value of AveOccup is -0.2226369.

The shap module provides a way to visualize the results in a more user-friendly format.

shap.initjs()
shap.force_plot(explainer.expected_value, shap_values, data_for_prediction.round(2), matplotlib=True)

Let’s take a moment to interpret this figure. The variables in red are the variables that have a positive Shapley value and variables in blue have a negative contribution to the prediction (i.e., negative Shapley values). For example: with a Shapley value of -0.2226369, the variable AveOccup appears in blue in the figure. Note that the number 3.03 associated with the variable AveOccup in the figure is not the Shapley value but is the value taken by the variable (see above). The length of the interval corresponds to the Shapley value. For instance, the Shapley value of MedInc is 0.30874621 and it corresponds to the length of MedInc in the figure.

In the figure, you can see a grey mark that indicates the base value. It is a reference value and Shapley values allow to explain how we go from this reference value to the prediction done by the model (i.e. 2.3390358700186686, on the figure the prediction is rounded to 2.34). The base value corresponds in a way to what we can predict when we have no information on the input variables. When we have no information, one of the values we can give is simply the average of the predictions on the training set. And the following code shows that this is indeed the case, the two values are exactly the same.

print(reg.predict(X_tr).mean())
print(explainer.expected_value[0]) # Base value

In the figure, if we add all the positive contributions in red and subtract all the negative contributions, then the Shapley values explain how we get from the base value to the prediction.

shap_values.sum() # 0.26708893263979216

If we add up all the Shapley values, we get 0.26708893263979216 which is exactly the distance between the base value (i.e., 2.071946937378876) and the prediction (i.e., 2.339035870018668).

We can also see this by adding the base value with the sum of the Shapley values and see that it is equal to the prediction.

print(explainer.expected_value[0] + shap_values.sum())
print(reg.predict(data_for_prediction)[0])

Conclusion

In this blog, we tried to show on the same example different techniques of local and global explainability. The three techniques presented are agnostic techniques that can be used with any type of machine learning algorithm.

Thanks for reading!

Please consider following me if you wish to stay up to date with my latest publications and increase the visibility of this blog.

And are you familiar with the Medium clap button 👏 ? Did you try it more than ones to see what happens 😎?

--

--