Interpretable Machine learning : Part I

The existing machine learning algorithms have overwhelmed the world with their predictive power. But, they often aren’t capable of providing explanations for their predictions.

It aims to answer the following questions:

  1. Why a particular prediction was made, as opposed to others?
  2. When does the model succeed? When to trust a model?
  3. When does it fail & why? When to disregard the model output?

Most of the existing models are black-boxes from interpretability point of view. But, firstly why do we need interpretable models?

Need for ML Interpretability

The adoption of ML-techniques in newer domains often faces reluctance due to lack of trust in the models. Be it the data scientists developing the model, the users of the model, or the stakeholders involved in its predictions, all need to understand the broad model behavior and have faith in the model.
The prime motivation behind model interpretability is building trust in models.

Transparency in the model is all the more important in applications involving critical decision making. In fields like medical, defense, judiciary, education etc., models must be strictly answerable for their predictions.
Model interpretability aims to provide accountability to model predictions.

Many times, good evaluation metrics don’t guarantee good real-world performance. The model might be learning something incorrect due to data leakage or biased training. For instance, when predicting the performance of articles on a news website, we may choose the click-through-rate(CTR) of article links from home-page as the performance-indicator / target variable. In this case, the model would learn to predict higher CTRs for clickbait articles, which might be undesirable.
Explanations help in detecting bias, debugging & improving the model, by revealing such fraudulent model behavior.

Interpretable models would help bridging the gap between data scientists and domain experts and enable an effective exchange of data insights & knowledge. Also, when models produce insights instead of plain numbers, they can better contribute to advancement of science by knowledge transfer across models.

As machine learning is being experimented with, applied & adopted in newer domains, interpretability of ML models has become a hot topic, both in academia & industry.
With the adoption of GDPR by EU, the concept of ‘right to explanation’ for machine learning has gained attention throughout the world. Also, the recent fatal accidents by self-driving cars have caused some raised eyebrows over AI’s trust-worthiness. The studies around adversarial examples capable of fooling models too underscore the need for models to be more robust and reasonable.

Status Quo

If we take a look at the existing models, there are some interpretable models like decision trees, linear regression etc., but they suffer with low accuracy. And more accurate models like deep neural networks, tend to be harder to interpret.

Interpretability-Accuracy Tradeoff

We can get the important features for a problem space using feature selection techniques like filter methods (correlation analysis), wrapper methods (searching over feature subset space to optimize model performance) etc. Such techniques do shed light on the real-world data trends, but not on the trends the model has learned.

Path to Interpretability

An explanation system can provide explanations mainly in two forms:

  • Relevant features affecting the predictions.
  • Minimal set of relevant training instances critical for the predictions.

Furthermore, the explanations can be either at

  • Global level, providing a high-level view of the model, or at
  • Local level, providing justification for a single prediction.

ML-interpretability approaches can be broadly divided into 3 categories:

  1. White-box explanations
    White-box explanation approach aims towards developing model specific explanations, leveraging model’s internal structure & insights.
  2. Black-box explanations
    Black-box explanation approach focuses on model-agnostic explanations. Such explanations are relevant to all the existing models as well as future models.
  3. Out-of-box Interpretable models
    Developing new models which are inherently interpretable.

White-box explanations

Extensive research is being done towards understanding the existing models better, with deep networks’ interpretability in the limelight. White-box approach towards ML-interpretability achieves this by extracting model-specific explanations, leveraging model’s internal structure & insights.

The remaining post covers some of the research efforts towards Understanding Convolutional Neural Networks the white-box way.
Most approaches to interpret CNN are based on the idea of highlighting relevant image aspects which contributed to a prediction. This is achieved by propagating the output signal back through the network, to somehow understand what part of the input image has the output encoded.

Saliency maps

This technique is based on computing gradient of the output score with respect to the input image. The generated gradient map (called Saliency map) highlights the relevant input pixels, to which the output score is highly sensitive.
For a CNN which takes input image I and outputs score Sc(I) for class c,

Input images & their Saliency maps. Source

Class activation maps

The fully connected layers in CNN stand as black-boxes between the convolutional layers and the classifier, leading to loss of the spatial information of the image. This approach replaces the FC-layers with ‘Global-average pooling (GAP) layer’. GAP averages each of the ’n’ feature maps of the last convolutional-layer, producing a n-sized vector. This GAP-layer output vector is further connected to a fully connected layer to produce the desired output (scores in case of classification) as shown below:


Averaging each spatial unit across the last-convolutional layer feature maps, weighted by corresponding fully-connected layer weights for a particular class gives its class activation map (CAM). This map, when scaled to input image dimension, highlights the discriminating image regions for that class.


Text-based explanation

Some researchers focus their efforts towards text-based justification of prediction for image classification. Such explanations are required when the nuances of the classes cannot be captured visually.

For eg. in image classification of bird species, the ideal explanation should comprise of a combination of discriminative features like bird size, color pattern of body parts, beak shape & size. Such features cannot be comprehended via mere visualization.

A model that provides prediction as well as text-explanation, similar to visual-description models was proposed in the paper Generating Visual Explanations. An image-explanation should not only describe the image-instance accurately, but also focus on class-discriminative aspects. This is achieved by minimizing both image-description loss & class-discriminative loss.


The proposed model is trained on a dataset containing images with category and description for each image. It consists of 3 components:

  1. Fine-grained Image classifier: Convolutional neural network pre-trained to predict category of the image. It is also used to extract strong image features.
  2. Sentence generator: Recurrent network which is trained to generate image explanation, based on the extracted deep features and predicted image category.
  3. Sentence classifier: LSTM-based classifier pre-trained to predict category based on image-description.

The sentence-generator is trained to minimize a novel combined loss, consisting of:

  1. Description loss: Based on likelihood of generating the ground-truth description.
  2. Discriminative loss: Based on likelihood of predicting the correct class from the generated sentence using sentence classifier.

Apart from above, several other methods have been proposed in pursuit to demystify neural networks, like Layer-wise relevance propagation, Guided-backpropagation, Grad-CAM etc. White-box approach also covers efforts to explain other models like tree ensembles, gradient boosting models etc. This field is evolving rapidly owing to the growing need to build trust and confidence in models.

The next post Interpretable Machine learning : Part II covers the black-box explanation and out-of-box explanation approaches in detail.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store