TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Explainable Neural Networks: Recent Advancements, Part 3

Looking back a decade (2010–2020), a four part series

G Roshan Lal
TDS Archive
Published in
7 min readFeb 7, 2021

--

Where are we?

This blog focusses on developments on explainability of neural networks. We divide our presentation into a four part blog series:

  • Part 1 talks about the effectiveness of Visualizing Gradients of the image pixels for explaining the pre-softmax class score of CNNs.
  • Part 2 talks about some more advanced/modified gradient based methods like DeConvolution, Guided Back Propagation for explaining CNNs.
  • Part 3 talks about some short comings of gradient based approaches and discusses alternate axiomatic approaches like Layer-wise Relevance Propagation, Taylor Decomposition, Deep LiFT.
  • Part 4 talks about some recent developments like Integrated Gradients (continuing from part 3) and recent novelties in CNN architecture like Class Activation Maps developed to make the the feature maps more interpretable.

Axiomatic Approaches

Up until now, we discussed gradient based methods for understanding decisions made by a neural network. But, this approach has a serious draw back. Due to the presence of units like ReLU and MaxPooling, often the score function can be locally “flat” for some input pixel or in other words have 0 gradients. Gradient based methods often attribute 0 contribution to pixels which saturate the ReLU or MaxPool. This is counter-intuitive. To address this problem, we need:

  • Some formal notion of what we mean by explainability or relevance (beyond vanilla gradients). What are the properties that we want the “relevance” to follow. It would be desirable for the relevance to behave like vanilla gradients at linear layers, since gradients are good at explaining linear functions.
  • What are some candidates that satisfy our axioms of “relevance”, which are also easy to compute, ideally we want to compute them in a single back pass.

Taylor Decomposition and Layer-wise Relevance Propagation (2015)

Axiomatic relevance was first explored by Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller and Wojciech Samek. They introduced the notion of Layer-wise Relevance Propagation in their work “On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation (PLOS 2015)”.

The authors propose the following axiom that relevance must follow:

  • Sum of relevance of all pixels must equal the class score of the model. We call this axiom “conservation of total relevance” from now on. This has been a popular axiom followed by other authors too.
Conservation of total relevance, Source: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130140

The authors propose two different ways to distribute the total relevance to individual pixels.

1. Taylor Decomposition

In this approach, the authors propose to choose a reference image X₀ which is to be interpreted as a “Baseline Image” against which the the pixels of image X are explained. It would be desirable to have the class score of this baseline image to be as small as possible.

Using a baseline image to compare the input image with to highlight the important pixels is a recurring theme in many axiomatic-relevance works. Some good examples of baseline image are:

  • Blurred input images: Works well in colored images
  • Blank (dark) image: Works well in grey-scale/black and white images

Given the baseline image X₀, we perform Taylor Decomposition of the class score function to obtain the relevance of individual pixels.

Taylor Decomposition, Source: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130140

An extended version of Taylor Decomposition for neural networks was suggested by the authors in another work of theirs called Deep Taylor Decomposition. Deep Taylor decomposition forms the theoretical basis of Layer-wise Relevance propagation described next.

2. Layer-wise Relevance Propagation

Taylor Decomposition is a a general method that works for any class score function. For neural networks, we can design a simpler method called Layer-wise Relevance Propagation.

For a neural network, the authors propose passing down the relevance down from the output layer to the contributing neurons.

  • Every time the relevance is passed down from a neuron to the contributing neurons in the layer below, we follow the conservation of total relevance of contributing neurons to the neuron from which the relevance is passed down. Hence, in LRP, the total relevance is conserved in every layer.
Conservation of Relevance in Layer-wise Relevance Propagation, Source: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130140
  • All incoming relevances to a neuron from the layer above are collected and summed up, before passing down further. As we do this recursively from the one layer to the layer below, we ultimately rich the input image, giving us the relevance of each pixel.
Summing incoming relevance at a neuron, Source: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130140

It remains to define how we we distribute the relevance of a neuron to its contributing inputs or input neurons. This can be achieved via multiple schemes. Here is one such simple scheme given by the authors:

Layer Relevance Propagation at a single neuron, Source: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130140

We note that the above scheme only approximates the conservation of total relevance axiom. To make it conserve the sum exactly, we have to redistribute the bias terms back to the inputs/input neurons in some way.

Here are some results of LRP on the ImageNet dataset:

Source: https://arxiv.org/pdf/1604.00825.pdf

DeepLiFT (2017)

Following the works of Sebastian Bach et al on LRP/Taylor decomposition, Avanti Shrikumar, Peyton Greenside, Anshul Kundaje proposed DeepLiFT method in their work Learning Important Features Through Propagating Activation Differences (ICML 2017). DeepLiFT(Deep Learning Important FeaTures) uses a reference image along with an input image to explain the input pixels (similar to LRP). While LRP followed the conservation axiom, there was no clear way on how to distribute the net relevance among the pixels. DeepLiFT fixes this problem by enforcing an additional axiom on how to propagate the relevance down.

The two axioms followed by DeepLiFT are:

Axiom 1. Conservation of Total Relevance: Sum of relevance of all inputs must equal the difference between the score of the input image and baseline image, at every neuron. This axiom is same as the one in LRP.

Conservation of Total Relevance, Source: https://arxiv.org/pdf/1704.02685.pdf

Axiom 2. Back Propagation/Chain Rule: The relevance per input follows the chain rule like gradients. This is enough to help us back propagate the gradient-like relevance per input. This axiom makes DeepLiFT closer to “vanilla” gradient back propagation.

Back Propagation/Chain Rule, Source: https://arxiv.org/pdf/1704.02685.pdf

The authors prove that the two axioms stated above are consistent with one another.

Given these axioms, what are some good candidate solutions for DeepLiFT? The authors suggest splitting relevance into positive and negative parts:

Positive and Negative contributions, Source: https://arxiv.org/pdf/1704.02685.pdf

Depending on the function at hand, the authors suggest the following candidate solutions for C() and m():

  • Linear Rule for linear functions: This is exactly same as using the gradients for m(). LRP would do the same as well.
Linear Rule, Source: https://arxiv.org/pdf/1704.02685.pdf
  • Rescale Rule for non-linear functions like ReLU, Sigmoid: This is exactly same as LRP.
Rescale Rule, Source: https://arxiv.org/pdf/1704.02685.pdf

Linear and Rescale rules follow LRP pretty closely.

  • RevealCancel (Shapley) Rule for non-linear functions like MaxPool: Using Rescale rule (with reference input of 0s) for MaxPool would end up attributing all the relevance contribution to the biggest input. Chages along other inputs would make no difference to the output. RevealCancel rule fixes this counter intuitive conclusion, using the idea of Shapley values.
Shapley Rule, Source: https://arxiv.org/pdf/1704.02685.pdf

Shapley values have been used in game theory for calculating attributions of input variables. A number of recent works on explainable AI (like SHAP) use ideas inspired from Shapley Values.

The authors show the results of using DeepLiFT on a CNN trained on MNIST dataset.

DeepLiFT vs Gradient Methods, Source: https://arxiv.org/pdf/1704.02685.pdf

Whats Next?

Continuing on axiomatic approach to relevance, researchers have developed a fully gradient based approach, called Integrated Gradients, which satisfies many desirable axioms. In recent times, researchers have also explored modifying CNN architectures to make it easier to peek into them. Some of the novelty around this involves using Global Average Pooling and Class Activation Maps. We discuss these cool techniques in the next part.

To read more about such exciting works on explainability of neural networks, you can catch the next part here: Link to Part 4

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

G Roshan Lal
G Roshan Lal

Responses (2)