Explainable and Interpretable Machine Learning — by Ryan Glassman and Michelle Maria Roy

Published in

Michelle and Ryan Explain ML

18 min readNov 30, 2020

The explainability problem is well known in machine learning and particularly deep learning. The fundamental question behind it is: how is a machine learning model reaching a particular endpoint? What are the key factors, or input features, contributing to a prediction? For ensemble models and compositional models (such as deep neural networks), this question is notoriously difficult to answer. Such models are difficult to ‘break open’ — it is challenging to decompose them into interpretable components. In some application spaces, such as criminal justice and banking, this lack of interpretability is not just undesirable, but unacceptable. In such spaces, the ‘why’ is just as important as, if not more important than, the prediction itself. Being able to adequately explain the ‘inner workings’ of a model, while preserving predictive performance, is a critical goal.

Much work has been done to address the explainability problem. In this post, we’ll examine examples of two macro-approaches: explaining existing model architectures, and developing new model architectures in which explainability is a fundamental design trait.

Explaining Existing Architectures

The goal of model explainability is to enhance the interpretability of existing model architectures given some or no access to their inner structure. These approaches can broadly be broken into two camps:

“Black box” approaches assume that one cannot access a model’s parameters — one can only observe the inputs to and outputs from the model. Such methods can sometimes function independent of the model architecture and typically explain the predictions of any classifier by learning an interpretable model locally around the prediction.

“White box” approaches, by contrast, require access to the parameters learned by the model. They typically leverage the architectural specifics of a model to provide a more granular interpretation.

Here we will examine two model explainability approaches. The first, SHAP (short for SHapley Additive exPlanation) is more of a ‘black box’ method, though there are model-specific variants of the approach that require more knowledge of the model. The second, integrated gradients, is a ‘white box’ method specifically designed for convolutional neural networks.

SHAP (SHapley Additive exPlanation)

The SHAP method is proposed in a 2017 paper by Scott M. Lundberg and Su-In Lee, then doctoral students at the University of Washington. Lundberg and Lee propose that SHAP constitutes “…a unified framework for interpreting predictions’’ in machine learning models. The authors begin by defining the class of additive feature attribution methods, which approximate the original model by taking a weighted linear combination of simplified input features. The weights attribute an effect to each feature, and the sum approximates the output f(x) of the original model.

The authors observe that the set of additive feature attribution methods includes existing methods like LIME and DeepLIFT.

From there, the authors define three desirable properties of feature attribution methods:

Local accuracy: when approximating a model f for an input x, the explanation model must at least match the output f(x) for the simplified input x’.
Missingness: if the simplified inputs represent feature presence (i.e., 1 indicates the presence of a feature and 0 represents the absence of that feature), then a feature i missing in the original input must have no impact on the approximator’s output (i.e., phi i = 0).
Consistency: if a model changes so that the marginal contribution of a feature value increases or stays the same independent of other features, the approximator’s value must not decrease (i.e., it also increases or stays the same).

At this point it’s important to explain Shapley regression values, a concept borrowed from game theory. Shapley values assign an importance value to each input feature that represents the effect on the model prediction of including that feature. Shapley values are a very robust, but highly computationally expensive, way of computing feature attributions.

To compute the Shapley value for a feature, the model is trained on a subset S of the full set of features F in the data. It is trained once with feature i present in S, and again with feature i absent from S. Then the two predictions on a given input x are compared. This retraining must be done for all subsets S: “Since the effect of withholding a feature depends on other features in the model, the preceding differences are computed for all possible subsets S ⊆ F \ {i}” (Lundberg & Lee 3).

The Shapley value for a feature, then, is a weighted average of all the possible differences over the different settings of S:

The authors then bring this all together by positing that the Shapley value method is the only additive feature attribution method that necessarily satisfies all three of the desirable properties mentioned above. Following from this, “…methods not based on Shapley values violate local accuracy and/or consistency” (Lundberg & Lee 4).

So Shapley values, then, are ideal, but as we saw, often prohibitively difficult to compute. So what we need is a more efficient way to approximate the Shapley values. This brings us to the SHAP values.

SHAP values are the Shapley values of a conditional expectation function of the original model. The authors write: “SHAP values provide the unique additive feature importance measure that adheres to Properties 1–3 and uses conditional expectations to define simplified inputs” (Lundberg & Lee 5). By approximating the model outputs of feature subsets with expectations, we can avoid retraining the model many times over. In the above equation (8), f_x(z’) is approximated as follows:

Where S is the set of non-zero indices in z’. Since most models cannot handle arbitrary patterns of missing values, f(z_s) is approximated with the above expectation.

Even computing the SHAP values directly is challenging. “However,” the authors write, “by combining insights from current additive feature attribution methods, we can approximate them” (Lundberg & Lee 5). Methods for approximating the SHAP values split into two camps: model-agnostic and model-specific.

Model-agnostic approaches include:

Shapley sampling: Shapley sampling values apply sampling approximations to the Shapley value equation, and they approximate the effect of removing a variable from the model by integrating over samples from the training dataset, thereby eliminating the need to retrain the model. No more than 2^|F| differences need to be computed (Where |F| is the number of features in the dataset).
Kernel SHAP: Kernel SHAP is an adaptation of the LIME approximation algorithm. Parameters for LIME are generally chosen heuristically; with Kernel SHAP, they are set in such a way that the Shapley values of the LIME approximation are empirically recovered, thereby upholding the aforementioned three desired properties.

Model-specific approaches leverage some knowledge about the class of the model being approximated. Such approaches include linear SHAP, low-order SHAP, and max SHAP, all of which are fairly straightforward adaptations of the SHAP values. Another is Deep SHAP, which adapts DeepLIFT “…to become a compositional approximation of the SHAP values” (Lundberg & Lee 7). Deep SHAP, unlike the other SHAP approximation methods, is a white box method. Like DeepLIFT, it requires access to the activations in the neurons of a deep neural network.

The authors show that SHAP achieves feature importance attributions very close to the true Shapley values, with low variance:

They also show that SHAP is consistent with human feature attributions of simple models. They write: “Our testing assumes that good model explanations should be consistent with explanations from humans who understand that model” (Lundberg & Lee 8). To collect data on human attributions, the authors explained the inputs and outputs of simple models to mechanical turks and had them assign credit for the output.

The authors also compared Deep SHAP to both DeepLIFT and vanilla SHAP:

To generate the ‘masked’ column in figure (A) above, the authors masked 20% of the pixels in the original image; these pixels were chosen to switch the predicted class from 8 to 3 according to the feature attribution given by each method. So, based on how much the ‘masked’ output looks like the number 3, we can get a sense for how well each method is attributing features.

Integrated Gradients

Using Integrated Gradients to visualize model learning and explain them was first proposed by Google Data Science researchers Sundararajan, Taly and Yan in the paper Axiomatic Attribution for Deep Networks. The authors note that even when the output and input is fully deterministic, causal inference was far from obvious. This becomes especially difficult in complex models that use deep networks. The authors aim to investigate causal inference via attribution in deep networks. We will further explore their findings in this section.

In the case of linear models (e.g. Regression, Decision Trees) given an input dataset, we attribute important features learned by the model to the magnitude of coefficients or weights associated with them.

Weights associated with input features in a linear model

Consider a deep network using the Inception architecture trained on the ImageNet dataset. It takes an image as input and assigns scores for different ImageNet categories. A possible approach to explain the results would involve identifying which pixels activated weights corresponding to a certain class. We cannot just examine the coefficients of the model as we do with linear models. Deep networks have multiple layers of logic and coefficients, combined using nonlinear activation functions. Let us try to use an approach similar to feature attribution in linear models for neural networks. If we use the gradients of the output with respect to the input we create a local linear approximation of the (nonlinear) deep network.

Feature importance can be visualized using the gradients as a (soft) window over the image. Let us consider a correctly classified picture of a fireboat as an example:

Fireboat image along with classes predicted by the Inception Model

Feature Importance of Fireboat Image using only Gradients as a soft window over the image

It looks like local linear approximation does a poor job of indicating what the network thinks is important despite being classified correctly. The authors observe that this is because the prediction function flattens in the vicinity of the input, and consequently, the gradient of the prediction function with respect to the input is tiny in the vicinity of the input vector. This phenomenon was not specific to a single model or input.

Prediction Function Score vs Input Intensity plot

The same plot also tells us how to fix the issue. Notice the large jump in the prediction score at low intensities? The authors observed that at different intensities or scaling factors for inputs, the model would still classify our previous fireboat correctly. The following diagram illustrates the fact that at lower values of intensity, the pixels constituting the fireboat and spout of water are most important. But as the intensity increases, regions around the fireboat (rather than the fireboat itself) gained relative importance. Combining the important features picked up at different intensities or scaling factors gives us a better picture of what’s happening.

Gradients at different intensities combined to give us a better idea of important regions in an image

This is the essence of feature attribution with integrated gradients.

A brief overview of the algorithm prescribed by the above equation is as follows:

Function F representing the model we have access to, an input x and a baseline input x’. Baseline input is uninformed input or as in the previous example a black image.
An interpolation path is then calculated between the baseline x’ and the input x. Interpolation path refers to the changes the image will undergo to go from the baseline to the given input.
The gradients w.r.t model F are computed, using the standard gradient operator, at set intervals on the interpolation path between x and x’.
Integration in a simple sense is the addition of slices to form a whole. The gradients computed at different intervals of the interpolation path form the integrated gradients by cumulating the gradients along the path.
Multiplying the difference in the input and the baseline with the computed integrated gradients tells us exactly what features were activated and contributed to the classification.

The integrated gradients method satisfy a number of important prerequisites that make them suitable explainers for any type of neural network. The authors have elaborated on the following properties in the paper:

Completeness: The attributions from integrated gradients sum to the difference between the prediction scores of the input and the baseline. This property is desirable because we can be sure that the prediction is entirely accounted for.

Linearity preservation: If a network F is a linear combination a∗F1+b∗F2 of two networks F1 and F2, then a linear combination of the attributions for F1 and F2, with weights a and b respectively, is the attribution for the network F. The property is desirable because the attribution method preserves any linear logic present within a network that is further built upon.

Symmetry preservation: Integrated gradients preserve symmetry. That is, if the network behaves symmetrically with respect to two input features, then the attributions are symmetric as well.

Sensitivity: Integrated Gradients is sensitive in two aspects.

(A) If the baseline and the input differ only in one feature, but have different predictions, then this feature gets non-zero attribution.

(B) If a feature does not play any role in the network, it receives no attributions.

The feature attributions of images over the actual images is an intuitive way to visualize the important regions identifies by a model using Integrated Gradients. The features that have been picked up can be observed in the figure below.

As mentioned before, this method is model and input agnostic. The authors have also described examples extending this methodology to textual and structured data in their original paper. The unofficial blog published by the authors, as well as this TensorFlow implementation of the method, can help you implement this technique in your model pipeline.

Other

There are several other black box techniques that you can also start with. LIME was one of the first techniques to receive some traction, followed by Partial Dependence plots and Anchors. Other white box /grey box approaches we recommend taking a look at include Grad-CAM (specific to CNNs) and DeepLIFT. (Note that as discussed above, SHAP adapts both LIME and DeepLIFT as some of its approximation methods.)

Interpretable Model Architectures

The goal of interpretable model architectures is to create neural network architectures that are highly accurate yet interpretable. With respect to the latest trends in Machine Learning, accuracy and interpretability seem like conflicting ideas: we tend to adopt deeper more ambiguous networks to achieve high accuracy with huge datasets. Explainable Neural Networks (xNN’s) are a new advancement in machine learning models, designed to provide explainable insights into the model. We further explore improvements made to the naïve implementation of xNN’s that allow them to go toe-to-toe with cutting-edge deep learning models while still maintaining their high explainability.

Explainable Neural Networks — xNN

xNN was devised in 2018 by a team of researchers at Wells Fargo led by Joel Vaughan. The team’s starting point for the model was the class of additive index models, whose output can be expressed as a sum of K smooth functions g_i(•):

The idea behind xNN is to reformulate the additive index model as a structured neural network:

Where h_i are subnetworks within the larger network (more on this shortly), and 𝛾_i are weightings of each subnetwork to the output layer, which are learned during training.

Below is the overall structure of the xNN:

The subnetworks (outlined in blue) learn the ridge functions h_i(•). Each subnetwork has univariate input and output, which allows the ridge functions to easily be plotted in R2. There are no connections between subnetworks, which allows the subnetworks to be completely separated from each other for analysis. The combination layer (final sigma on the right hand side of the figure) is where the 𝛾_i parameters are learned. A linear activation function from this layer ensures that the output of the network as a whole is a simple linear combination of the ridge functions.

“In practice,” the authors write, “fitted xNN models exist on a spectrum of model recoverability while retaining a high degree of explainability” (Vaughan et al 6). In other words, the underlying function may or may not be totally or partially recoverable — and in real-world scenarios, it’s impossible to know whether the original function has been recovered. But whether xNNs learn an underlying function or just a good approximation of it, they maintain their explainability.

The authors provide the following example, defining a function as a summation of the first three Legendre polynomials:

They show that an xNN with five subnetworks exhibits high function recoverability and high explainability — the subnetworks properly learn the individual polynomial functions:

The first subnetwork roughly learns the first Legendre polynomial and properly learns a high coefficient for x_1, and basically zero coefficients for the other input features. Likewise for subnetworks two and three. Subnetworks four and five correctly discern that x_4 and x_5 have no effect on the output, and so they learn constant functions.

For underlying functions that are not additive, the xNN still exhibits high explainability, even though recoverability is low.

The authors note that xNNs may also be used as surrogate models; they can be trained on the inputs and outputs of a parent network to attempt to explain the decision making of that network.

The xNN is perhaps limited in its applicability — its requirement of univariate input and output to the subnetworks makes it a poor candidate for adaptation into fields like computer vision. However, in fields where only low-dimensional modeling is often required, such as finance, it shows promise.

Enhancing Explainability of xNNs

A few key improvements to the Naive xNN architecture were proposed in the paper Enhancing Explainability of Neural Networks through Architecture Constraints by Yang et al. In order to replace current popular deep neural network architectures, the xNN architectures would have to achieve comparably high accuracies while guaranteeing interpretability in all scenarios of noisy real-world data. The authors pointed out the following drawbacks of the original naïve xNN implementation that could be improved on:

The authors suggested the following architectural changes to help improve explainability:

The model is trained and estimated using the SOS-BP algorithm. This algorithm is based on modern neural network training techniques (including backpropagation for calculating the derivatives and the Cayley transform for preserving the projection orthogonality, mini-batch gradient descent, batch normalization, and the Adam optimizer). The proposed model is both identifiable and explainable and can be rewritten as follows:

T1≥1 and T2, T3≥0 are the regularization constraints. The multiple constraints (2a-2e) help impose the interpretability considerations from the sparse, orthogonal, and smooth perspectives. 2a and 2b allow us to construct Spare Additive Subnetworks using two l1-norm constraints for sparsity in ridge functions and projection weights using an appropriate choice of T1 and T2. This makes some of the and W values tend to zero, thus creating sparsity. 2d is the Orthogonal Projection Pursuit that creates projection indexes that are mutually orthogonal (Stiefel Manifold). 2e satisfies zero norm and unit norm requirements. 2c is the Smooth Function Approximation and uses the functional roughness penalty in order to enforce smoothness of each ridge function. We formulate it using the integrated squared second-order derivatives over the entire range of projected data and rewrite it as Omega.

A model is not identifiable if it has more than one representation and such non-uniqueness is quite problematic. It turns out that the interpretability constraints imposed on xNN model can also make it identifiable or unique. This identifiable nature helps enhance interpretability.

By the method of Lagrange multipliers, both the l1 penalty and the l2 roughness penalty can be formulated as soft regularizers, while the orthogonality, zero-mean and unit-norm requirements are hard constraints. This leads to the following constrained optimization problem:

The authors have proposed the following SOS Back Propagation algorithm to solve the optimization problem:

The proposed SOS-BP algorithm adopts the mini-batch gradient descent strategy, and it utilizes some of the latest developments of neural network training techniques, making it capable of handling very big datasets. The algorithm’s steps are as follows:

a.) For initialization, the projection matrix W should be generated subject to the orthogonality constraint

b.) Each subnetwork modeled by the feedforward neural network can be parametrized by H

c.) The roughness penalty Omega for each ridge function is evaluated empirically for each mini-batch data

d.) The three soft regularization terms are automatically taken into account by the Cayley transformation and gradient descent. To deal with the zero-mean and unit-norm constraint for each ridge function, a normalization procedure is required. We adopt the popular batch normalization strategy for this with momentum set to zero.

As a rule of thumb, we can rank all the subnetworks according to their importance ratios which are derived from the learnt β values. This gives us a good idea of the important features learnt by the model.

Several benchmark models were considered for comparison — this included the xNN.naive model, SVM, random forest, LASSO and logistic regression. They considered various different scenarios to test the effectiveness of the improved xNN model. The Additive Model Scenario follows all assumptions made by the enhanced xNN model. It consists of four additive ridge function components with mutually orthogonal projection indexes. They have also considered a Non-Orthogonal Additive Model where the model also takes the additive form, but the projection indexes are not mutually orthogonal. The final worst-case scenario is one which violates all AIM constraints. The results for all the scenarios are presented in the paper with the best results (validation loss) highlighted in bold. Let us go over a few results:

Validation Loss of various models compared to xNN.enhanced

We see that even in worst-case scenarios, the xNN.enhanced model achieved the best or nearly the best performance in most cases. This supports the claim that xNN is competitive with respect to prediction accuracy. The visualized model fits further illustrate that the xNN.enhanced is closer to ground truth compared to the naïve model in terms of learning the importance of input data features.

Visual Representation of Model Fits with respect to features compared to the ground truth

In each sub-figure, the left panel shows the ridge function, and the right panel presents the corresponding projection index. The ridge functions and corresponding projection indexes are sorted in the descending order of the importance ratio (IR). It can be noted that the xNN.enhanced model is capable of learning feature importances closer to the ground truth. Although we have only considered the model fits from one scenario in our post, the paper illustrates the feature importance in all scenarios in greater detail. We encourage readers to review all of them. It is observed that xNN.enhanced always outperformed the other models in explainability when compared to the ground truth. Both sets of experiments reveal to us that this new model architecture provides an effective approach that balances predictive accuracy and model interpretability. It is justifiable to consider xNN.enhanced as a promising first step towards high accuracy interpretable machine learning architectures. Coincidently, the original paper was written in collaboration with the Corporate Model Risk team at Wells Fargo, and the authors have also demonstrated the effectiveness of the model on low- to moderate-risk samples on the LendingClub Dataset. The peer-to-peer (P2P) lending is a method of lending money through online services by matching individual lenders and borrowers. It’s a highly buzzworthy FinTech application, and the xNN architecture has shown promising results for the same.

Concluding Thoughts

This post covers some of the techniques that can be used in creating more transparent intelligent systems. Explainable intelligent systems are essential in high-risk fields like medicine and finance. Employing techniques such as the ones covered above not only enables us to build more accountability into ML systems; it also allows us to better debug those systems and make strategic improvements to models, which helps us better manage ML system lifecycles in production environments.

In an ideal scenario, the intelligent systems field would move toward highly accurate and inherently interpretable architectures wherever possible. While these initially seemed like conflicting objectives, architectures like xNN and its recent improvements now represent a promising step toward unifying them. We hope to see more such advancements in the near future that enable the development of responsible and transparent technology that can be applied with minimum risk in a wide range of real-world scenarios.