Published in


Deep Dive into Neural Network Explanations with Integrated Gradients

A Practitioner’s Guide

Deep neural networks are highly utilized models that have shown great success in particular domains such as image, natural language processing, and time-series. While the efficacy of these models on these specialized domains is unrivaled, neural networks have often been thought of as “black-box” models due their opacity.

Given this, how can we peer into and understand neural networks? Model-agnostic explainability methods such as LIME, SHAP, and QII [2] do exist, and they operate under the black box assumption that only inputs and outputs are known. However, these methods can be impractically expensive to compute for neural networks, where the number of features tend to be an order of magnitude higher than other domains such as tabular data. Images will have thousands of pixels with multiple channels, time-series data introduces a time dimension factor, and NLP models often utilize high-dimensional embedding spaces to encode language. Luckily, neural network training processes and frameworks are often contingent on gradient descent methods. The availability of network gradients offer an alternative process of creating explanations which provide additional input that both aid in the speed of explanation methods and also provide axiomatic benefits of the explanations.

This blog will explain the Integrated Gradients (IG) method in detail, including the mathematical foundations, how it compares to other methods, and how you can use it yourself. The examples shown are easily illustrated in the image domain, but IG can be used for any deep learning tasks.

Original Image by StockSnap on Pixabay and edited images by author | Left: Original image | Middle: Integrated Gradients explanation of beagle | Right: Integrated Gradients explanation of mountain bike

What is the Integrated Gradients method?

Integrated Gradients was proposed by M. Sundararajan, A. Taly, Q. Yan in Axiomatic Attribution for Deep Networks. The equation to compute the Integrated Gradients attribution for an input record x and a baseline record x’ is as follows:

Let’s start by breaking down the components of this equation and the motivations behind it. We’ll start by contextualizing Integrated Gradients within our explanation taxonomy referenced in our blog post: Picking an Explanation Technique. In our taxonomy, we must define for an explanation technique: Scope & Outputs, Inputs, Access, and Stage.

The first component is what scope and output we are trying to explain. Integrated Gradients are flexible enough to explain the output of any differentiable function on the input x, the most straightforward function being the scalar output of a neural network classifier. In the above equation this is the F function operating on x. The scope of the IG method is both global and local. How this achieves a local and global scope is best discussed after expanding upon the axioms a bit more in the next section.

The input being explained would be the x vector that the neural network can operate on. The simplest case being the top level input of the network, though that may not always be the case. For example, in NLP use cases, the explanations have to be on the embedding space and not the sentence inputs because gradients cannot be propagated from the embedding dictionary.
K. Leino, S. Sen, A. Datta, M. Fredrikson and L. Li in Influence-Directed Explanations for Deep Convolutional Networks also proposes a method for explaining internal layers of a network for further analysis.

Lastly, the stage and access of this explanation are post training with access to network gradients. The reasoning is such that we need to query and take derivatives from the neural network because the model helps compose the function F.


A handful of desirable axioms are satisfied by IG that are outlined in the paper [1]. We will highlight two of those axioms that are particularly important for an explanation method:

  1. The completeness axiom
  2. The sensitivity axiom

The main motivation to use IG is the completeness axiom. That is, “given x and a baseline x’, the attributions of x add up to the difference between the output of F at the input x and the baseline x’” [1]. This property is desirable because it directly ties the attributions to a contribution towards the F function, and gives the local explanation proportional weights to each of the input components towards the output scalar. The mechanism that makes this possible is due to the inverse nature of the integration and derivatives of a function along a path, allowing the recreation of the function value.

At this point it is appropriate to describe why this is also desirable to achieve a set of explanations that apply to the global scope in the explanation taxonomy. Achieving the completeness axiom in addition to choosing a constant baseline means that each explanation of a record is comparable with any other explanations of a different record. The relative contribution of all attributions are on the same scale, which is the scale of the output of the F function. This could be [0,1] in the probits space, or (-inf, inf) on the logit scale.

The second desirable axiom of the IG method is the sensitivity axiom. This axiom is described in two parts.
1. “For every input and baseline that differ in one feature but have different predictions then the differing feature should be given a non-zero attribution.” [1]
2. “If the function implemented by the deep network does not depend (mathematically) on some variable, then the attribution to that variable is always zero.” [1]

Generally, gradients will violate this axiom. The last component of the IG equation, α, which denotes the path from the baseline to the x value, helps fix this issue. The derivative with respect to α creates a multiplicative factor (x-x’), which means that if the baseline and x value are the same for some xᵢ, there will be no attribution supplied to that input. In other words, the path-line integral simultaneously satisfies the completeness and sensitivity axiom because the gradients are only collected along a path of the k dimensions that observe change from the baseline.

Integrated Gradients proposes that the straight line path between x’ and x is used, which is primarily to satisfy the symmetry axiom: “Two input variables are symmetric w.r.t. a function if swapping them does not change the function. For instance, x and y are symmetric w.r.t. F if and only if F(x, y) = F(y, x) for all values of x and y” [1]. While this does hold true, it should be noted that further research into IG based attributions may not require this axiom to be satisfied. There are limitations to the straight line path in that the interpolations may cross into out of distribution points in the network’s decision manifold. So long as F is continuously differentiable, integration of gradients along any path within the k dimensions will satisfy the completeness and sensitivity axioms.

Method comparisons

It is worth comparing and considering the fundamental differences between different explanation methods and the pros and cons of using IG versus other common explainability methods.

Integrated Gradients and Saliency Maps

Let’s first compare IG to another common gradient explainer: saliency maps. In the simplest sense, saliency maps are the gradients of the input features of a neural network with respect to a final output. This explanation highlights the features that are the most reactive and likely to quickly change the output, but only makes sense for small deviations away from the original input. The completeness axiom of IG gives a stronger (and more complete) representation of what went into the output. This is because the gradients received for saliency maps are generally in the model’s saturated region of gradients.

Image by Author, inspired by Ankur Taly’s Explaining Machine Learning Models — Stanford CS Theory

In a similar comparison, the relationship between saliency maps and IG are analogous to the relationship between LIME and Shapley values.

Integrated Gradients & The Shapley Value

The best reason to compare IG and Shapley-based methods like QII [2] is because both of these methods give explanations that can be used both locally and globally as they satisfy analogous axioms. The completeness and sensitivity axioms of IG are analogous to the efficiency and dummy axioms referenced in our blog: The Shapley Value for ML Models

Integrated Gradients and Shapley-value based strategies take two different approaches to approximating the Shapley-value. The Shapley-value has a generalization to the infinite and continuous domain called the Auman-Shapley value. The concepts are rooted in coalitional game theoretics which aims to average influences over all contexts. For the Shapley value, the space is factorial so a sampling method is used. For the Auman-Shapley value, the space is infinite, but the Integrated Gradients straight line path can also be thought of as a unique sample that preserves desirable axioms of the Auman-Shapley value.

In this sense, the differences in the sampling is where the theoretical differences can be seen. QII constructs a set of hypothetical records near x which allow the explanation to be more robust by testing if a feature is consistently a driver of the output, or if other features changing may make a feature’s value have less influence. You may even think of these as similar to gradients, just at a larger step size. IG on the other hand, only picks one path but takes many samples along that path. By doing this, IG does lose some robustness by not evaluating as many hypothetical paths. On the other hand, because of the continuous differentiable nature of neural networks, the sample can arguably have a stronger distributional faithfulness than discrete Shapley-value methods.

There is no strong case to be made in terms of which method is better in the theoretical sense, so generally methods are chosen for practical purposes. QII is a method that does not require any access to the model internals, thus is more useful for a wider range of model architectures. Neural network domains are higher dimensionality where discrete Shapley-value sampling methods are usually intractable. The space of records near x is much too large to have statistical significant results for models with thousands of inputs. The availability of gradients allows for a manageable amount of calculations to be made. This estimation is done by taking gradients at discrete partitions of the straight line between x and x’ baseline and multiplying by the activation multiplier (x — x’). If the gradient manifold is sufficiently smooth, many times 10–20 discrete partitions are sufficient. In highly volatile manifolds, more interpolation points can be taken but in practice, rarely needs more than 100 points to be a close estimator.

There is further research that can bridge the context gap between IG and Shapley-value based methods such as finding representative neutral baselines, having multiple baselines, or using semantically meaningful baselines, such as close decision boundaries. One such method is the boundary attributions variation [4], which sets the closest class boundary as the baseline.

Integrated Gradients in Practice

The rest of this blog will highlight the many choices to make if using IG in a practical setting. When using IG in practice, there are quite a few choices that a practitioner can make, and flexibility is essential. The following article will use an open-source gradient explanation library called TruLens to showcase how IG can be used in practice. The code examples will reference Distribution of Interest (DoI), Quantity of Interest (QoI), and a custom InternalInfluence method which are the building blocks of TruLens outlined in: A Hands-on Introduction to Explaining Neural Networks with TruLens.

Translating Integrated Gradients to Code

In the previous sections IG was defined as a function in the continuous space, but we also highlighted that an estimation can be done by discrete partitioning of the straight line interpolation.

TruLens has a ready to use implementation of IG that is faithful to the original method.

from trulens.nn.attribution import IntegratedGradients
from trulens.visualizations import MaskVisualizer
# Create the attribution measure.
ig_computer = IntegratedGradients(model, resolution=10)
# Calculate the input attributions.
input_attributions = ig_computer.attributions(beagle_bike_input)
# Visualize the attributions as a mask on the original image.
visualizer = MaskVisualizer(blur=10, threshold=0.95)
visualization = visualizer(input_attributions, beagle_bike_input)
Original Image by StockSnap on Pixabay and edited by author | Visualized Beagle Class Explanation: Integrated Gradients

The implementation and customization of IG is quite simple as well, utilizing the TruLens Distribution of Interest (DoI). The DoI represents a set of records to average over; for IG we want to use the linear interpolation from a baseline. The TruLens open source implementation defines this linear interpolation DoI, and that alone is sufficient to start creating customizations on top of IG.

TruLens DoI averages all gradients in the distribution. To apply IG’s sensitivity, the LinearDoI also defines the get_activation_multiplier.

You can inherit the TruLens DoI class to implement your own DoI. Once you have defined a DoI, you can construct your custom attribution using the InternalInfluence method.

from trulens.nn.distributions import DoIclass MyCustomDoI(DoI):
def __call__(self, z):
def get_activation_multiplier(self, activation):
from trulens.nn.attribution import InternalInfluence
from trulens.nn.quantities import MaxClassQoI
from trulens.nn.slices import InputCut, OutputCut, Slice
# Define a custom influence measure
infl = InternalInfluence(model,
Slice(InputCut(), OutputCut()),

If one were to experiment with different path strategies, the DoI would be the best place to do so.

Customization of Output Function

Another customization area of Integrated Gradients is the output F to be explained. You may choose to explain the logit vs probit layer. The reasons to explain either would be dependent on the final use case. Model probabilities would give exact contribution to score, whereas logits might give better comparative scores to records that are closer in the near 1 or 0 regions of the probit space that would be squashed by the sigmoid function. The InternalInfluence method in TruLens lets you choose any output layer via Slice and Cut objects.

from trulens.nn.attribution import InternalInfluence 
from trulens.nn.quantities import MaxClassQoI
from trulens.nn.distributions import LinearDoi
from trulens.nn.slices import InputCut, Cut, Slice
# Get the layer name to explain from the model summary
model.summary() # you may be able to also see from print(model)
layer_to_explain = 'logits'
# Define a custom influence measure
infl = InternalInfluence(model,
Slice(InputCut(), Cut(layer_to_explain)),

Another use case is defining F as the class comparison of outputs. For example, looking at a vehicle dataset, if querying about the specific class, IG may focus on the wheel as to why a car is chosen as the main classification.

Image from Influence-Directed Explanations for Deep Convolutional Networks [3] reposted with permission Internal attribution features of a car

If we instead change the F function: F = (Class A output) — (Class B output). This asks why might the model think the classification should be convertible vs a car or vice versa, then the IG explanation would then focus on the roof of a car.

Image from Influence-Directed Explanations for Deep Convolutional Networks [3] reposted with permission Internal attribution features of a car vs convertible

This F query function can easily be changed in TruLens by defining a custom QoI, and supplying this to the InternalInfluence instantiation. The QoI’s __call__ function should return the scalar value to take gradients from.

from trulens.nn.quantities import QoIclass MyCustomQoI(QoI):
def __init__(...):
def __call__(self, y):
from trulens.nn.attribution import InternalInfluence
from trulens.nn.distributions import LinearDoi
from trulens.nn.slices import InputCut, OutputCut, Slice
# Define a custom influence measure
infl = InternalInfluence(model,
Slice(InputCut(), OutputCut()),

Customization of Baseline

The last topic of this blog, but probably the most important consideration in IG, is the choice of the baseline. In the image domain, the most prevalent baseline used in literature is the empty image. Semantically, it is a very intuitive baseline: the final attribution scores will be the attribution difference of the image itself minus a presumably informationless baseline.

Unfortunately there are quite a few counter example images that highlight that this could be a problematic baseline. One example pertains to the sensitivity axiom, which is if there is no difference from the baseline for a pixel, then that pixel will get a zero attribution. In reality an empty image is not informationless, but actually encapsulates the color black. So if one were to run Integrated Gradients on an image in which the color black is actually important to the classification, these attributions would be lost. One such example might be an image of a penguin, panda, or a black car.

Another counter example of undesired behavior due to this baseline might be the presence of a watermark. Let’s assume a training set where 99% of images share the same water mark. Each record with the watermark would be present in all classifications. Because the watermark is guaranteed to be different from the baseline, and gradients do not implicitly satisfy sensitivity, the watermark will end up with randomized attributions because each record would be using it to explain a different outcome. However, the desired behavior would be to have no attribution towards the watermark. For this case, a distribution mode baseline would make more sense.

This highlights a need to understand the dataset distributions when selecting a baseline. There is some research that highlights that a baseline choice that is very far from the class clusters can result in randomized attributions, and that you may want to choose baselines that result in stronger localization of attributions [4]. One might want to utilize a distribution average or mode when selecting a baseline. One recommendation is to select a baseline that has intrinsic meaning, so that the score minus the baseline also retains meaning. In this way, the attributions will be more human-interpretable.

This could mean finding a neutral baseline with 0 score, where the explanation will always explain the model score. Or in the case of boundary attributions [4], the baseline denotes the closest misclassification. This will create an explanation that shows the regions that are most likely to cause the misclassification, or in other words, it highlights the most confusing inputs. This is an open and active area of gradient explanation research.


[1] M. Sundararajan, A. Taly, Q. Yan, Axiomatic Attribution for Deep Networks (2017), Proceedings of the 34th International Conference on
Machine Learning, volume 70 of Proceedings of Machine Learning Research

[2] A. Datta, S. Sen and Y. Zick, Algorithmic Transparency via Quantitative Input Influence (2016), Proceedings of 37th IEEE Symposium on Security and Privacy

[3] K. Leino, S. Sen, A. Datta, M. Fredrikson and L. Li, Influence-Directed Explanations for Deep Convolutional Networks (2018), Proceedings of the IEEE International Test Conference

[4] Z. Wang, M. Fredrikson and A. Datta, Robust Models Are More Interpretable Because Attributions Look Normal (2021), Preprint.

This article was originally published in Towards Data Science.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Rick Shih

Rick Shih


A leader in developing solutions connecting machine learning with production data science teams