Metrics Mystery: Exploring SCG and CCG for Image Model Evaluation

Shuvam Das
12 min readApr 12, 2024

--

Well, we all have worked with image-oriented models, like Dense Net architectures and all. However, a problem that we all face during such models is finding a proper working metric to evaluate the model’s performance.

In this article, I will discuss an evaluation metric. In fact I have been researching for a few days on metrics and I came to know about some pretty fascinating metrics and will be discussing a few on my upcoming articles starting with today…. drum rolls……. (Simple Confidence Gain (SCG) and Concise Confidence Gain (CCG)).

Introduction

A little bit of formality in discussing the introduction from the paper I would be using for the Metric explanation.

In the ever-evolving landscape of deep neural networks (DNNs), particularly in the realm of image recognition facilitated by deep convolution networks (CNNs), there’s been a remarkable surge in performance. Yet, amidst this progress lies a quest to unravel the intricate workings of these models. The complexity inherent in these networks has spurred a keen interest in comprehending and elucidating their inner mechanisms. Too much literature? Well, let me shorten it a bit.

Various visualization techniques have emerged, aiming to shed light on which pixel features wield the most influence in CNNs’ classification predictions for a given image. From leveraging gradients to systematically occluding images, these methods strive to unveil the discriminative pixels driving classification decisions. However, amidst the diversity of algorithms lies a critical question: How do these techniques stack up against each other in identifying crucial image regions? Traditionally, the evaluation of visualizations has leaned towards qualitative assessments and human studies, gauging which regions humans deem most discriminative. Yet, the subjectivity inherent in such measures raises concerns about aligning human perception with CNN’s actual decision-making process. Moreover, the limited sample size in these studies poses challenges to reproducibility.

In this paper that I have been using they have advocated for an objective measurement of feature importance relative to predicting CNN. It introduces two metrics: Simple Confidence Gain (SCG) and Concise Confidence Gain (CCG). SCG measures the increase in accuracy achieved by adding important pixels to an uninformative baseline image, while CCG extends this by assessing the conciseness of the pixel region required for correct classification. Applying these metrics, it conducted comparisons across three distinct algorithms on two datasets, including our bespoke collection of building floor images. The results demonstrate internal consistency and a strong correlation with previous subjective evaluations. However, I acknowledge the potential disparity between human perceptions and actual pixel importance.

Importance Functions

The definition of an Importance Functions is formalized before proposing metrics for evaluating them. It is assumed that a CNN classifier outputs the probability of an image having a certain classification given the trained weights. For clarity, the ith pixel in the image is referred to as I[i]. An importance function, which takes as input the image and the classifier, outputs a heat map containing a measure of the relevance of each pixel to the class.

High confidence regions are those pixels that have high values in heat map H. where Io,j is the jth occluded image and J is the number of images generated by systematic occlusion.

Imagine you have a computer program that can look at a picture and tell you what’s in it, like whether it’s a cat or a dog. But how does the program decide? That’s where Importance Functions come in. These are tools that help us understand which parts of the picture the program pays the most attention to when making its decision.

We’re going to talk about three ways we can study these Importance Functions:

  1. Occluding Patches: This method involves covering parts of the picture with a gray square. If covering up certain areas makes the program less confident about its answer, then those areas are probably important. We calculate a heat map to show which parts are crucial for the program’s decision.
  2. Gradients: Here, we look at how much the program’s confidence changes when we tweak each pixel in the picture. If changing a pixel a little bit affects the program’s decision a lot, then that pixel is important. This method gives us a map showing which pixels matter most.
  3. Contrastive Marginal Winning Probability (C-MWP): This technique uses a fancy way of figuring out which parts of the picture are most important. It looks at the neurons in the program’s brain (yes, it’s like a tiny digital brain) to see which ones are firing the most when it’s making a decision. By understanding which neurons are active, we can pinpoint the crucial parts of the picture.

By using these methods, we can get a better idea of how the program works and what it’s paying attention to when it decides what’s in a picture. This helps us improve the program and understand why it makes the decisions it does.

In trying to figure out how computers decide what’s in a picture, we need to know which parts of the picture are most important for their decision. Different methods might point out different key parts, so we need a fair way to compare them. In the past, people were asked to rate these methods by looking at the pictures, but everyone’s opinion can vary. So now, there’s a new idea: instead of relying on what people think, they’re suggesting a way to measure importance that relies more on how the computer itself sees the picture.

This new approach uses the computer program’s insights to understand which parts of the picture really matter for its decision. It involves creating different versions of the picture using various methods. These methods include things like using special filters and techniques to highlight important pixels. By comparing these different versions of the picture, researchers hope to find a more reliable way to figure out what’s important to the computer when it’s looking at images.

Figure 1: (a) an original image, (b) base image obtained from using Gaussian kernel Gk,( c )heat map obtained using C-MWP where red and blue represents the most and the least important pixels, (d) binary mask obtained after thresholding the heat map for top ρ=5% pixels, (e) mask obtained after growing the regions of the mask in (d). (f) and (g) are the hybrid images created using mask in (d) and (e) respectively using base image (b). (h) is the hybrid image obtained using mask in (d) and a base image obtained using zeros kernel Zk.

Problem Formulation

Instead of relying solely on heat maps to visually assess whether the importance function has identified significant areas within an image, we’re taking a more practical approach. We’re harnessing these heat maps to pinpoint a specific set of pixels. This set, when integrated into a baseline image, results in a classification accuracy comparable to the original image.

To accomplish this, we generate a binary mask, essentially a map indicating which pixels constitute the important region. Each pixel is assigned a value of either 1 (if it’s considered important) or 0 (if it’s not). We determine this mask by selecting the top percentage of pixels with the highest values from the heat map.

Our objective is to enhance a base image with these important pixels to gauge their impact on classification accuracy compared to the original image. We employ different types of base images for this purpose, such as one blurred using a Gaussian kernel and another composed entirely of zeros. These variations in base images allow us to explore the significance of the added pixels in achieving accurate classifications.

Furthermore, we introduce two metrics to quantitatively measure the confidence gain attributed to the important region relative to the original image. These metrics provide insights into the extent to which the identified important pixels contribute to the classifier’s confidence in its predictions. By adopting this comprehensive approach, we aim to gain a deeper understanding of the role of specific pixels in influencing classification outcomes.

Metric: Simple Confidence Gain (SCG)

The Simple Confidence Gain (SCG) provides a straightforward way to evaluate the impact of important features on classification accuracy. It compares the improvement in accuracy achieved by adding important pixels to a base image with the accuracy improvement achieved by adding these pixels to the original image.

Here’s how it works: SCG calculates the ratio of the accuracy improvement from the base image to a hybrid image containing only important features, compared to the improvement from the base image to the original image. This comparison is crucial because it helps us understand how much the identified important pixels contribute to the overall accuracy.

It’s important to note that SCG assumes a predefined kernel, which remains the same for all masks being compared. This ensures consistency in the evaluation process. Additionally, SCG calibrates the classification probabilities to measure only the relative increase in accuracy attributed to the important features, excluding any influence from non-important regions that may have been altered by the kernel.

SCG generates values ranging from 0 to 1. A value close to 1 indicates that the masked pixels significantly enhance classifier accuracy, while values closer to 0 suggest minimal contribution from the mask. This metric provides a clear indication of the effectiveness of the identified important features in improving classification accuracy, and facilitating informed decision-making in model evaluation and refinement.

Metric: Concise Confidence Gain (CCG)

The Concise Confidence Gain (CCG) metric enhances the evaluation provided by SCG in two key aspects, adding depth and precision to the assessment of important regions in image classification. Firstly, CCG emphasizes the necessity for the identified important region to not only exist but also contribute significantly to accurate classification. This ensures that the evaluation focuses on regions that are truly informative for the classifier’s decision-making process. Secondly, CCG delves into the spatial characteristics of the important region, measuring its compactness relative to the overall image size.

Here’s a closer look at how CCG operates: It aims to expand the initially identified important region under the mask (M) to create a new, accurate mask (AM) that sufficiently encompasses the features necessary for correct classification. This expanded mask, when integrated into the hybrid image (IAM,K), ensures that the classifier accurately predicts the class. Various methods can be employed to expand the mask, including adjusting the threshold of the heat map or utilizing operations like dilation to enlarge the boundary regions of the important pixels.

Once the new hybrid image is formed, CCG is calculated by dividing the relative confidence by the ratio of the area masked by AM to the total image size (N). Unlike SCG, which provides a general assessment of the importance of identified features, CCG offers a more nuanced understanding by considering both the accuracy of the hybrid image and the compactness of the mask.

In essence, while SCG measures the total information contained in a set of features, CCG quantifies the density of information within a region necessary to determine the class accurately. This comprehensive approach allows for effective comparison of features of different sizes and provides valuable insights into the importance and spatial characteristics of identified regions in image classification. Through the integration of CCG alongside SCG, researchers gain a more holistic understanding of the factors influencing classification outcomes, enabling informed decision-making in model evaluation and refinement.

Experiments and Results

Experiments were conducted on two datasets with three importance functions. They outlined the datasets and the CNNs used, followed by detailing the experimental procedures. Results were presented, analyzing the performance of each function based on metrics like Simple Confidence Gain (SCG) and Concise Confidence Gain (CCG). They compared the accuracy of individual importance masks, aiming to provide insights into their reliability for image classification tasks.

The three importance functions used in the experiments required specific parameters to be set for optimal performance. For the “occluding patches” (occ) method, the size of the patches was varied to observe its impact on the results. They experimented with patch sizes of 10, 50, and 100 pixels to determine which size yielded the most effective outcomes. This variation allowed them to assess how different levels of occlusion affected the identification of important features within the images.

Similarly, for the “gradients” (grad) technique, they applied a method to smooth out the initial heat map, which tends to have high entropy. By dilating the raw heat map with a 3x3 kernel multiple times (0, 2, and 5), they aimed to enhance the continuity of important regions. This process helped to clarify and refine the boundaries of significant features within the images, making them easier to interpret and analyze.

Figure 2: (a) The image masks for the (top) occ importance function and (bottom) grad importance function generated with the parameters ρ=25% and with dilation = {0, 2, 5} respectively for one image from the Building-Floor dataset (left three images) and one image from the Places365 dataset belonging to the class amusement station (right three images). (b) A side by side comparison of occ(patch size = 10), grad(dilation = 5), and C-MWP respectively on an image from the Building-Floor and the Place365 dataset (ρ=25%).
Figure 3: A side by side comparison of the three pairs of importance functions (grad+occ, CMWP+grad, and C-MWP+occ respectively) on an image from the Building-Floor dataset and the Place365 dataset (ρ=25%).

In the case of the “Contrastive Marginal Winning Probability” (C-MWP) method, they utilized a specific layer of neural networks to generate the heat map. To ensure consistency and accuracy, they utilized the source code provided by the authors of the method. This step was crucial in maintaining uniformity and reliability across the implementation of the C-MWP technique in their experiments.

Once the heat maps were obtained from each importance function, they proceeded to convert them into binary masks using a simple thresholding approach. This involved selecting the top 5% and 25% of features consistently across tests to create the binary masks. By standardizing the thresholding process, they aimed to ensure consistency and comparability in the evaluation of the Importance Functions.

Furthermore, the creation of base images involved employing two distinct techniques: one that utilized a Gaussian kernel to blur the images and another that replaced unimportant pixels with black using a zero kernel. These variations in base image generation allowed them to explore different representations of the images and assess the impact of these representations on the accuracy of the subsequent hybrid images.

To ensure the accuracy of the hybrid images, they expanded the regions of the binary masks using a 3x3 dilate operation. This step was crucial in aligning the important features identified by the masks with the corresponding regions in the images, thereby ensuring the accuracy and reliability of the hybrid images for subsequent analysis.

During the experiments, certain criteria were applied to filter out images that did not meet specific conditions. For instance, images where the changes made by the importance functions contradicted the assumptions of the experiment were excluded from the analysis. Additionally, images where the masks had to be excessively enlarged to avoid distorting the original mask were omitted from certain calculations to maintain the integrity and reliability of the results.

Conclusion

In total, 38 images from the Building-Floor dataset and 180 images from the Place-365 dataset were used for testing, providing a comprehensive evaluation of the individual importance functions’ masks. The quantitative results showcased the effectiveness of each function, revealing insights into their performance across different parameters. For “occluding patches” (occ), experiments showed that smaller patch sizes tended to perform better on average, capturing more concise and relevant features within the images. Similarly, for “gradients” (grad), dilating the heat map improved performance by enhancing the continuity of important regions.

Comparative analysis among the different importance functions highlighted the superiority of the “Contrastive Marginal Winning Probability” (C-MWP) method, followed by “gradients,” and then “occluding patches.” This trend was consistent across various metrics and visual assessments, underscoring the robustness and reliability of the proposed evaluation metrics.

The impact of varying parameters, such as the size of the mask, was also explored. Findings indicated that smaller masks tended to yield higher Concise Confidence Gain (CCG) scores, reflecting the importance of concise and compact feature representations in achieving accurate classifications. Moreover, experiments revealed that the choice of kernel parameters did not significantly influence the performance of the Importance Functions.

An analysis of the agreement between different Importance Functions unveiled that common features resulted in more concise and discriminative regions, which could be advantageous for visualization and classification tasks. Notably, combinations of importance functions, such as C-MWP+grad, demonstrated superior performance compared to individual methods, further emphasizing the value of integrating multiple approaches for feature identification.

Comparing the objective metrics to prior subjective evaluations highlighted consistent trends, with C-MWP consistently outperforming other methods. However, it was noted that subjective assessments may not always align with objective metrics, emphasizing the importance of considering the features actually utilized by the classifier.

In conclusion, this study contributes objective metrics, SCG and CCG, for evaluating important functions in image classification tasks. By leveraging these metrics, researchers can objectively assess the quality and effectiveness of different feature importance measures, offering valuable insights into the inner workings of convolutional neural networks (CNNs). Ultimately, these metrics serve as valuable tools for advancing our understanding of CNN classification and facilitating more informed decision-making in algorithm development and model interpretation.

So yes this was a pretty good working metric that can be used for image segmentation metrics. All the images that I have used are from the original research paper. You can check it out here:- Classifier-Based Evaluation of Image Feature Importance.

I will be posting more metric-related work soon. Keep a look if you want to know more about new working metrics.

--

--