Explainability techniques for the Cloud — Looking inside deep learning models produced on Azure’s Custom Vision

Sindre Eik de Lange
Compendium
Published in
10 min readJul 2, 2020

Introduction

This blog post will cover a specific use-case that my colleagues and I worked on for Elvia AS, an electric utility company located in Oslo, Norway. This entails the exportation of a trained (neural network, CNN) model from Azure’s Custom Vision platform and the usage of various explainability techniques to better evaluate the quality of our black-box model.

We will focus on the aforementioned exported model and the use of different tools to derive explanations for the predictions of this model. However, the lessons learned are relevant for anybody who has a trained (frozen) TensorFlow graph (.pb) or an ONNX model on their hands and would like to run some explainability experiments because this is essentially the formats Custom Vision offers for exporting trained models.

For those interested in learning more about the different techniques please see the Reference-section — there is truly a myriad of sources on this fascinating topic.

Use Case

Computas were hired by Elvia to create machine learning software that will eventually be a plugin in a large data handling system with the intent of investigating the possibility of enriching the data stored in their database with additional metadata. Furthermore, the software is thought to have the ability to identify deviations in the data, aka. flag images showing various damages. The initial use case focused on classifying utility poles by their materials such as tree, concrete, or steel. Here we can see some examples of such poles:

From left to right: tree, concrete, and steel poles.

To solve this problem, we used the Custom Vision service on Microsoft’s Azure platform because of their friendly GUI with drag-and-drop training, testing and classifying, and relatively easy-to-use SDK with the same functionality.

The training set contained about 300 images of each pole, with variations in how much of the pole was present in the images, the backgrounds, sizes, etc. Using these images we implemented data augmentation focusing on flipping the images horizontally and vertically, and resizing, expanding our dataset to 798, 903, and 918 images containing concrete, steel, and tree poles, respectively, with each image being 256x256.

After uploading the images and training a model on the Custom Vision platform using their “Quick Training”-option we got the following results:

Results from training an ML model on Azure’s Custom Vision service using 800–900 images of 3 different types of utility poles.

The problem

The result seen above gives the impression that the model is performing relatively well, and we even created a separate test-set of images, comprising about 50 images for each label (also augmented and resized), which backed up the results.

However, in today’s data scientist/machine learning society, we know that a model that scores well using various metrics does not necessarily mean that the model is “good”. One example of this is the classic wolf vs. dog experiment where the model performed well on the original dataset, but when examining the model it became evident that the model used the background, not the wolves or dogs to decide on the classification. More specifically, if the background contained snow, it classified the image as “wolf”, and if it did not contain snow, it classified the image as “dog”.

With this and other experiments showing the need for verifying trained neural network models, we were curious to see what our model based its predictions on — was it looking at the poles, or did it figure out some correlation between the background and the labels instead? Or maybe somewhere in between?

The solution

When wanting to “look inside” a black box model there are mainly two ways to go about this, each with their own sub-set of techniques:

Internal — here we depend on access to a model’s internal weights, more specifically the weights of the last convolutional layer to calculate the importance of each node during an inference for a specific image. As far as we know, the most used technique here is called “Heatmaps” or “Class Activation Maps” (CAM), with many variations such as GRAD CAM, Guided Grad Cam, etc. Here the aim is to do the aforementioned importance calculations to create a grid of pixel importance which is overlaid on the classified image, resulting in a “heatmap” shown in this image:

Example of heat maps or class activations maps from “Grad-CAM implementation in Pytorch”.

Perturbations — this technique depends on multiple, often hundreds or thousands, of iterations through the model’s inference functionality, compared to the internal’s need for only a single iteration. The thought here is to input the image that is to be classified many times with small perturbations such that one can identify which parts of the image are the most important for the model to classify the image to a specific label. The most popular techniques here are LIME and Anchoring, which were developed by essentially the same team in 2016 and 2018, respectively.

Great visualization of the LIME technique from “Knowing What and Why? — Explaining Image Classifier Predictions”.

Generally, techniques utilizing a model’s internal structure during inference will give a better impression of what regions the model uses for predicting the label it did. However, given a (frozen) TensorFlow (.pb) or an ONNX model, which is essentially what the Custom Vision service offers, this proved difficult to implement because of the lack of access to the model’s internal values. This is why we are going to talk about both groups but focus on the last group, perturbations, because this is what we were able to use with the model produced by the Custom Vision service.

Class Activations Map (/Heatmaps)

As mentioned, CAM is a technique dependent on access to a model’s internal weights and values calculated during inference, where the image is run through the model one time resulting in the heatmap being created.

This post will not go into details of the math behind this technique, however, a simplistic explanation is that the inference runs like normal until the last convolutional layer where instead of the standard global average pooling over all of the feature maps, leading to one value being multiplied with their corresponding weights, the feature maps themselves are run through the weights, resulting in a pixel importance grid, the heatmap, which can be “laid over” the classified image.

LIME

LIME stands for “Local Interpretable model-agnostic explanations” which is a technique presented in a paper from 2016 called “Why should I trust you?”. This technique, as we understand from the name, is “model-agnostic” meaning it works on different types of models, hence different types of data such as tabular, text, and images.

Here we can see an example of this technique, using an existing Python library, used on an image from our dataset:

From left to right: Input image comprising a steel pole, positive pixels for predicting the label “steel pole” highlighted while the rest is turned off, positive pixels with a green overlay and negative pixels with a red overlay.

The technique works by, given an input image and a specific label for which to calculate the pixel importance for, the program splits the image into different regions based on how similar the pixels are, called “Superpixels”, and runs the image through multiple iterations of inference with different superpixels switched on and off with each iteration. This results in the program finding the most important pixels for one specific label. It’s important to note here that the label does not necessarily need to be the label the model will assign the image to, which differs from the heatmap technique.

More specifically, LIME tries to emulate a linear model for the different superpixels and see how their variations affect the prediction, and then optimize the distance between what the model predicted and what we wanted the model to predict. These local models are called “surrogate models”. Determining the optimal splitting into superpixels is a hard problem that can potentially result in explanations that do not give a good representation of what is happening.

This leads to hyperparameter tuning for each specific use-case which can cause the explainability function to be overfitted to the problem, and not being stable/reproducible. One technique we found to mitigate this was to run the LIME technique multiple times with different values for the random seed. An example of this can be seen under “Experiments”.

This is not a thorough explanation of the intricacies behind the technique, for this please see here.

We should also mention that the LIME technique does not work well against adversarial attacks, which simply means noise added to an image specifically to make a machine learning model misbehave, e.g. assign the wrong label to an image, wrongly claim that a certain group of superpixels are important for some label, etc., which is impossible for humans to notice.

Anchoring

Like LIME, Anchoring is a perturbation technique that is model-agnostic, meaning that it depends on multiple iterations of inference and can be used for various types of models, and therefore different types of data.

As mentioned above, Anchoring is developed by the same people that developed the LIME technique and can be viewed as an improved version of LIME. Anchoring is based on larger regions of the image, compared to LIME, and instead of running a specific number of iterations of inference with small variations in the input, it utilizes reinforcement learning, more specifically the “multi-armed bandit problem” in combination with graph search algorithms to find the optimal local regions, hence reducing the number of iterations needed. However, this comes at a price with even more hyperparameter tuning than LIME.

Here we can see an example of this technique, using an existing Python library, used on the same image as in the LIME example — and it shows that the techniques mostly agree on which region is important for predicting the same label (though with anchoring using a larger region).

An example of the explainability technique called Anchoring applied to an image of a steel pole.

Experiments

Now we would like to show some interesting findings for the LIME technique

Noise

An example of the explainability technique called LIME applied to an image of a concrete pole, with the model’s confidence values for each of the labels. Please note the translation: BETONG = concrete, STAAL = steel, and TRE = tree.

In this image of a concrete pole, we can see that the model is fairly certain that it is indeed a concrete pole. Furthermore, we can see that a large portion of the pole has a green overlay, showing the pixels positive for predicting “concrete pole”, however, not the entire pole.

An example of the explainability technique called LIME applied to an image of a concrete pole with assumed noise removed.

If we remove the “noise”, that is the portion with a red overlay showing the pixels with a negative impact on predicting “concrete pole” it seems like a larger portion of the pole has a positive impact on the “concrete pole” classification.

So it seems like the pole in itself was important initially, but there was some noise in the image that the model was unable to disregard.

Increased prediction certainty

An example of the explainability technique called LIME applied to the previously seen image of a steel pole with its confidence values for the different labels.

If we input a steel pole that we have seen earlier — notice the confidence values: only 73% certainty that it is a steel pole (STAAL) in the image and 26% that it is a tree pole (TRE), and then do a similar experiment by removing the majority of the red pixels — the model is significantly more certain that there is indeed a steel (STAAL) pole in the image, and a larger portion of the pole is deemed positive towards predicting “steel pole”. This indicates that the LIME explainability function works and its surrogate models can emulate the neural network model.

An example of the explainability technique called LIME applied to an image of a steel pole where the model’s certainty is increased after removal of assumed noise.

Stability

As mentioned before, there is a question about this technique’s stability because of the number of hyperparameters, as LIME is mainly based on small pixel areas, so how does the LIME technique perform with different random seeds values?

A utility pole made of concrete from the dataset.

We ran the same image through 9 times with 9 different random seeds: increasing the value by 10 for each iteration, starting at 1. Now, we could have an increased variation, though we agreed that the experiment showcased the model’s stability.

An image of a concrete pole ran through the LIME technique using different random seeds for testing the technique’s stability.

Conclusion

To conclude this blog post we would like to repeat some of the most important points:

  • We showed that it is possible to use explainability techniques such as LIME and Anchoring to get insight into the inner workings of a (black box) CNN model trained on Azure’s Custom Vision platform.
  • These two techniques are based on multiple iterations of inference with small changes to the input, which is different from other techniques such as class activation maps which depend on the model’s internal parameters.
  • LIME and Anchoring both give somewhat similar results, which we deemed stable based on experiments with random seeds. These results make us pretty confident that our model is not only basing its predictions on the object itself but takes the background into account, leading us to think back to the wolf vs. dog problem.
  • It is important to note that these techniques do not necessarily identify a solution to problems uncovered when using them, such as assigning too much importance to the background, meaning that we still depend on domain experts to come up with possible solutions — based on the output of the techniques.

Lastly, we have been in contact with Microsoft and searched online for similar solutions with no luck. Hopefully, this blog post can help people and companies in the same position that we were in and put them in a position to succeed when needing to explain what their CNN model(s) base their predictions on.

References

Wolf vs. Dog: https://www.researchgate.net/figure/A-husky-on-the-left-is-confused-with-a-wolf-because-the-pixels-on-the-right_fig1_329277474

Class Activation Maps: https://arxiv.org/abs/1512.04150

Grad CAM: https://medium.com/analytics-vidhya/visualizing-activation-heatmaps-using-tensorflow-5bdba018f759

Guided Grad CAM: https://medium.com/@ninads79shukla/gradcam-73a752d368be

PyTorch Grad CAM: https://github.com/jacobgil/pytorch-grad-cam

Deep dive into explanations techniques such as LIME: https://towardsdatascience.com/knowing-what-and-why-explaining-image-classifier-predictions-680a15043bad

“Why Should I Trust You?”: https://arxiv.org/pdf/1602.04938.pdf

LIME library for Python: https://lime-ml.readthedocs.io/en/latest/lime.html#module-lime.lime_image

Adversarial attacks against LIME and SHAP: https://arxiv.org/abs/1911.02508

Anchoring library for Python: https://lime-ml.readthedocs.io/en/latest/lime.html#module-lime.lime_image

--

--