Image Detection with AI Explainability Project

13 min readMay 24, 2022

Team Members: Muruganantham Jaisankar, and Shakshi Sharma
Supervisor: Marharyta Domnich

Motivation

Using the two image detection datasets, the goal of this study is to investigate different explainability techniques for black-box neural network models. The first dataset is utilized for image classification, and the second dataset is used for object detection tasks. The two tasks are chosen to see how the explainability performs in these scenarios.

Introduction

In order to follow us, keep the three steps in your mind. First, decide on the dataset, second, apply deep learning models to classify the images. And third, applying various AI explainability approaches to the best-trained model. In addition, we perform explainability on the detection task utilizing the COCO dataset. Finally, we will show the explainability of image classification vs. object detection tasks.

Firstly, we shall choose two different datasets,

Fake News (Misinformation) Indian dataset.
COCO on yolov5 with SHAP.

Let's discuss the datasets and their methodology one by one.

Fake news dataset: The images were collected from WhatsApp public groups and annotated in this paper. In all, there are 740 instances of fake news (misinformation) and 771 instances of true news (non-misinformation) in the Indian dataset.

An example of Fake Images in the dataset: The image of the 2 Rs. note is being circulated on social media, with the claim that it is new and being introduced for the first time in this group exclusively (written in one of the Indian languages: Hindi). In reality, this form of two-rupee note never appeared on the market.

Figure 1: Image from the Fake news dataset.

The problem we would like to address is to train or fine-tune deep learning-based models to determine whether a particular image is fake or not. Before that, preprocessing of the dataset is essential for training such models, which we shall cover next.

The code is available on the Github repo.

Dataset preprocessing

Preparing data for machine learning and artificial intelligence approaches is a crucial step. On the image dataset, we do simple preprocessing steps. To know more about image processing, look here. Following are the preprocessing steps performed on fake news data:

Rescale: Images have been rescaled to the 32 * 32 * 3 dimensions.
Normalize: Subtracting the mean image from the original image.
Data Split: The data has been split into train, validation, and test 80:10:10 ratio, respectively.

As the images contain text, we attempted to extract the texts from the images using the python library: tesseract. The majority of the images, however, are in regional or national languages, as this data is from India. As a result, translating the raw language into English and then converting into required feature vectors does not appear to be a sensible idea. Therefore, we drop this approach.

Model Training

Next, we use deep learning models especially Convolutional Neural Network (CNN), for classification tasks. Then, apply explainability approaches to the best-performing model.

Let us consider the following three models:

Two-layer neural network (Baseline model)
CNN — pre-trained on the CIFAR dataset
CNN — trained on our dataset

Later, we also try different flavors of the CNN model by tweaking the hyperparameters. Let's start discussing the models one by one.

Two-layer neural network: We consider the two fully connected layers. The input images are converted to grayscale (that is, one channel) and fed to the model. The output layer consists of two neurons for classifying fake and true images.
Pre-trained CNN: The CNN model is already trained on the CIFAR dataset. It has two convolutional layers, one max pooling, followed by three fully connected layers. The input to the model is a colored image (that is, each image has three channels). The output layer has two neurons.
CNN: We also focus on training the CNN model from scratch on our own dataset. Thus, we utilize two variants of the CNN models: CNN_v1 and CNN_v2. Broadly, both variants vary in the size of the model. CNN_v1 and CNN_v2 have one and two convolutional layers, respectively. Also, CNN_v2 uses regularization techniques such as dropout rate, batch normalization, etc.

The detailed model layers are in Table 1.

Figure 2: Deep Learning architecture (We compiled this picture).

Hyperparameter Tuning

In order to improve the model’s accuracy, we perform hyperparameter tuning. The details are shown in table 1 below. The best performing model is CNN_v2 for fake news data with 60% accuracy. One likely explanation for the model’s failure to achieve better than 60% accuracy is that majority of the images contain texts which may be overlooked or misunderstood by the model. In addition, these texts are in different regional languages.

models hyperparameters — Table 1: Models’ Hyperparameter tuning. * indicates the best hyperparameter value. (We compiled this table).

Explainability of image classification

Since neural network models are inherently black-box in nature. As a result, it’s essential to understand how the model makes its decisions so that the automated models can be trusted. The neural networks interpret or explain the models using various explainability approaches. Explainability, in basic terms, shows the pixels (words) that are important for the prediction. It explains each prediction by crediting each input information according to how it affects the prediction (positive or negative affect) producing the importance scores for the inputs. To know more, click here.

To describe the black-box model, we utilize the Captum library. In the explainability approach, we use the best-performing model. Integrated Gradients, DeepLift, and Saliency Maps are three primary-based attribution methods being evaluated for explainability. The similarity between these approaches is that they used gradients to calculate the importance of the input features.

A baseline image is used in the Integrated Gradients (mostly a “black image” that represents the absence of features in the input image). It accumulates the gradients on the images interpolated between the baseline value and the current input to determine the importance scores for the input. Second, DeepLift is another baseline approach that produces importance scores by propagating scores from the model’s output to the input back. It employs a multiplier-based technique to assign responsibility for output differences to specific neurons. Finally, Saliency Maps is the older and simplest of the two. With regard to the input pixels, it calculates the gradient of the loss function for the class we’re interested in.

Now, let’s look at a fake news image using our CNN_v2 model data and the results of different explainability approaches.

Figure 3 has been altered, as can be seen in the original image (top left). That is, the child’s face has been changed with the face of a politician, implying that the image is fake. The other gradient-based explanations in the figure reflect the features the trained model considers important in detecting this image as a fake image. The gradient images vividly show the child’s face while also emphasizing other parts of the image, indicating that the model has been properly trained. It can be noted that all the explainability approaches used provide importance to the same parts of the images implying all the approaches are a good fit for our problem.

Figure 3: Different Explainability Methods on Fake Image (We compiled this picture).

Limitations/Future work

This work can be extended using other deep learning architectures such as RNNs, or fine-tuning the transformer models.
The extracted texts from the images could be used to classify the images.

Next, we discuss how explainability can be applied to object detection tasks.

Object Detection

About Coco?

The COCO dataset is the most popular object detection, segmentation, and captioning dataset for public use, and it has several features such as superpixel stuff segmentation, 80 object classes, and so on. COCO stands for common objects in context, it is created to advance computer vision to train, test, to refine object detection models.

Figure 4: Keypoint detection (a feature of the COCO dataset)

About Yolo?

You only look once (Yolo) is an algorithm that uses Convolutional neural networks to detect objects. It is popular in many real-time applications because of its speed and accuracy.

Yolo workflow

Residual blocks
Bounding box regression
Intersection over Union

Residual blocks

Yolo separates the images into different grids, as shown in Figure 5, and each grid has a size S X S.

If any object appears in a grid cell, then the grid cell is responsible for detecting the object.

Bounding box regression

Yolo uses a bounding box to detect an object, and it contains center points(bx, by), width, and height (bw, bh). Output can represent an object class that matches the bounding box. Figure 6 shows how object detection works with the bounding box. To make it clear, the bounding box gives a probability of an object’s presence by using the object classes (person, car, etc.).

Intersection over Union(IoU)

Intersection over union is a metric used to evaluate the accuracy of an object detector on a particular dataset. A high IoU value shows that the model has detected the objects correctly.

To apply IoU to an object model, we need

Ground truth bounding box- (a hand-labeled box from the testing dataset). Training an object detection model needs a dataset which has separated as training set and testing set, and each of them has its own bounding box (hand-labeled) of an object.
Predicted bounding box-( predicted bounding box from a model)

Figure 7 shows the bounding boxes, the white box shows the ground truth bounding box (it belongs only to the person class), and the green box shows the predicted bounding box (which includes other than the person object class).

Due to varying parameters of a model (feature extraction, sliding window size, etc.), getting a perfect match between the predicted bound box and the ground truth bound box is simply no easy task with the object detection models.

Here Figure 8 shows the matches and their goodness. If IoU > 0.5, then the prediction bounding box is good. If IoU = 1, then the prediction bounding box is perfect, and the Yolo model will make sure that the IoU is good.

Figure 9: Intersection Over Union (Area overlap divided by Area of Union)

Yolo can check the predicted bounding box and its confidence score with other predicated bounding boxes, if the two bounding boxes overlap with each other, then the model takes it as a single bounding box by avoiding the predicted bounding box with a lower confidence score. If the two bounding boxes have not overlapped each other, then the model will consider as there are two different objects in the two predicted bounding boxes.

In addition, yolo put down all bounding boxes that have a lower probability of an object class (it is called Non-maximum suppression) by using Intersection over Union.

Steps involved in Non-maximum Suppression

Select the Bounding box that has a higher confidence score of an object (Check with probability map).
Check overlap (Intersection over Union) of this box with other bounding boxes.
Reject/Suppress the bounding boxes with IoU>0.5
Forward to the next bounding box that has the highest confidence score,
Repeat steps 2 to 4.

Non-maximum suppression is used to detect objects only once.

To make it clear, see the below figures.

Figure 10: Yolo model detection as a regression problem

From Figure 10, first, the image is divided into grids, then bounding boxes are predicted, and each object is detected by using IoU with Non-maximum Suppression.

NOTE the POINT: Have you noticed why YOLO is named You only look once? if you follow the above steps, you can see that the model looks at the images only once and detect all the objects without revisiting the image.

The architecture of the Yolo model

Figure 11: YOLO Unified, Real-time object detection [6]

Yolo consists of 24 convolutional layers followed by two fully connected layers, and each layer is separated with different functionality [6]. The first 20 layers of CNN are followed by an average pooling layer and a fully connected layer, these layers are pretrained on the Image dataset with half of the image resolution( 224 x 224 x 3), the dataset contains 1000 class classifications. The last 4 layers of CNN are followed by two fully connected layers and added to train the network to detect objects.

For object detection, the resolution of images is increased to 448 x 448 before feeding into the final layer. The final layer predicts class probabilities and bounding boxes, and it uses linear activation, whereas other layers use ReLu activation. In the end, the output is the bounding box with the prediction of the object class (probabilities).

Yolo Version 3

Yolov3 uses Dark-Net 53 as the model backbone (a 106 layers neural network), feature maps extracted at the layers 82, 94, and 106 for predictions.

Yolov3 predicts 3 bounding boxes per cell, but it makes three predictions at different scales at 3 different layers.

Yolo version 5

It is based on the Yolo model pretrained on the COCO dataset, a family of object detection models and detection methods. Yolov5 is an open-source project maintained by Ultralytics.

About SHAP Explainability

Shapley Additive explanation is an approach to explaining the output of machine learning models. Using Shap library, Shapley value is calculated, and Shapley value is used to get the marginal contribution of features towards a machine learning model.

To make it clear, let’s take a non-technical example, a student graduating from a University, we can only know he has graduated, however, we don’t know about in which subject the student is good or bad, in order to get those details we can use a transcript of records.

In a similar way in image object detection, to know which pixel/grid/feature is giving more contribution to detecting an object, we can use shap values.

In Yolo, we can get shap values for each grid in the images so that we can know which grid has contributed more/less to detect an object.

The higher shap values show higher contribution, at the same time, the lower shap values show lower contribution.

With the help of steadforce, shap values for Yolo have been implemented.

Here we are going to check shap values for different Yolov5 models and Yolov3 models. These models are already pretrained on the COCO dataset, the Yolo model returns the bounding box coordinates and the probability of an object at those coordinates with a probability of each class object that is in the COCO dataset.

Superpixels

In order to get which pixels have more contribution for the object detection, a superpixel model is created, and each superpixel can create a separate grid with respect to object detection.

For example, the YoloV5s

In the above picture, each grid superpixel is grayed because they contribute more to detecting an object.

The Yolo models’ explainability for objects

Shap values for YoloV5s

Kernel explainer fits with superpixel models and gives shap values for each grid.

Figure 13: Superpixel contribution to target (person) prediction on Yolov5s

The probability of a person class (target) superpixel with a high score is shown with a box in the picture. If we look at the box, dark red grids give more contribution to person detection than blue grids, dark red grids have higher shap values than blue grids.

Similarly, Yolov5x

Figure 14: Superpixel contribution to target (dog) prediction on Yolov5x

Here, the target class is dogs. If we look at the box, the dog’s legs are focused on detecting the dog class instead of all other grids. It is obvious that all the other grids are not contributed to the detection since the model is trained on the dog class COCO dataset.

YoloV3

Similarly, in YoloV3, the dog’s center body is more focused than in the YoloV5x model, and the dog’s legs are focused similarly in both models.

YoloV5L

Here in this picture, the target superpixel is a skateboard. The model focuses on the person’s legs and the skateboard parts, but if we look at the other person’s skateboards, the model is not focusing on that. In that way, we can decide that Yolo models are not suitable for tiny object detection.

Comparing the Yolo models’ explainability to the “Person” class

Comparison of the Yolo models wrt explainability for the person class (We compiled this table).

The higher the contribution to the correct object class, the higher the prediction.

Image Classification vs. Object Detection Explainability

We observed that classifying an image includes a wider perspective of the image, which in the fake news tasks entails looking at the unusual parts, whereas detecting objects focuses on the specific objects in the image, whether small or big. According to our observation, in object detection tasks, the position of the object at which the model is looking matters a lot but does not matter much in fake news classification tasks.

Conclusion

We conclude that the explainability methods are performing well on the tasks that we explored. We delve dive into training the CNN model for the fake news detection task and applied gradient-based explainability methods. Furthermore, we investigated how the pretrained Yolo model (trained on the COCO dataset) applied to the SHAP explainability tool and gained insights from their explanations.

References

[1] https://shap.readthedocs.io/en/latest/#

[2]https://christophm.github.io/interpretable-ml-book/neural-networks.html

[3] https://www.steadforce.com/how-tos/explainable-object-detection

[4]https://colab.research.google.com/drive/1RVCpvRdtAzbcda5M2NZzEnm1jyp4CQc2?usp=sharing

[5] https://captum.ai/tutorials/CIFAR_TorchVision_Interpret

[6] https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf

Image Detection with AI Explainability Project

Written by Shakshi Sharma