Top explanation methods for medical images

Weronika Hryniewska
ResponsibleML
Published in
6 min readJul 19, 2021

In this article, we will give a quick introduction to explanation methods (XAI) and then review the explanation methods used to classify images of COVID-19 patients, focusing on the most popular approaches and the most common mistakes.

When designing predictive models for healthcare, or any other high-stakes decisions, the explainability of the model is a key part of the solution. The empirical performance of the model is very important, but there can be no responsible modeling if the issue of explainability is not addressed properly for each stakeholder of the system.

For physicians, the lack of explainability drastically reduces the confidence in the system. For model developers, it makes it difficult to detect flaws in model behavior and obstructs debugging (Biecek, 2021), (Holzinger, 2017).

For predictive models, two general approaches to explainability are either by using classes of interpretable-by-design models or using post-hoc explanations. Despite the obvious advantages of interpretable-by-design models, their construction requires more domain knowledge linked to the construction of interpretable features. The advantage of post-hoc explanations is that they are constructed after the model has been trained. Thus, the developer can focus on model performance by pouring large volumes of data into a neural network and then deal with model explanations afterwards.

Figure from paper (Hryniewska, 2021)

Due to the mode of operation, also post-hoc explanation methods can be divided into two groups. The first group consists of input perturbation methods such as Locally Interpretable Model Explanations (LIME), or Occlusion Sensitivity. These methods are based on the analysis of the change of the model response after obscuring, removing, or perturbing some part of the image. The advantage of this class of methods is that they are insensitive to the internal structure of the model. Such so-called model-agnostic approaches assume nothing about the internal structure of the model. By analyzing how a series of input perturbations affect the final prediction, it determines which part of the input is important.

The second group consists of methods based on the analysis of signal propagation through the network, i.e. model-specific methods. This group of methods uses detailed information about the network architecture and the design of subsequent layers to determine the key regions of input for the final prediction. The advantage of such approaches is that usually, one pass through the structure of the network is sufficient to generate explanations. Model specific methods for explanations of CNNs can be organized into a spectrum of solutions, from gradient-based methods to activation map-based methods.

For gradient-based methods, the gradient dy/dx between the output model class y and the input image x is used to calculate saliency maps (to show parts of the image that contributes the most to the neural network’s decision). For large networks, the gradient information is very noisy, so there have been many modifications to this method that reduce noise by smoothing or thresholding or rescaling. This class of models includes Guided Backpropagation, Layer-wise relevance propagation (LRP), and SmoothGrad.

Methods based on activation maps, such as Class Activation Mapping (CAM) or DeepLIFT, focus on visualizing the relationship between the layer with the feature map (in most cases the penultimate layer of the network) and the model output. Assuming that the feature map stores information about the spatial relevance of features, one can explore what elements of the feature map are most relevant for the final prediction. Such methods often have an assumption about the structure of the network, such as global average pooling before the softmax layer.

In our analysis, the most popular solution for Covid-19 classification turned out to be the one that combines both mentioned above approaches, tracing the gradient between the model prediction and the feature map and then analyzing the spatial information of a specific part of the feature map. This group of methods includes the most popular explanation method Grad-CAM and its modifications Guided Backpropagation, Guided Grad-CAM, Grad-CAM++. Using gradient tracking between the feature map and network output is also a more flexible approach in terms of network architecture without enforcing global pooling.

Table from paper (Hryniewska, 2021)

In most of the reviewed studies, the application of XAI comes down to the series of colorful images without any assessment about how valid these explanations are. Colored explanations obscure the original image, which makes it even more difficult to assess their correctness. In images with XAI heat maps, it is often hard or impossible to see pathologies and guess if the model works well. Raw lung images shall be put next to explanations.
Also, the explanations should be interpreted or validated by radiologists. Otherwise, they are redundant and contribute nothing to the trustworthiness of the model.

Together with the radiologists, we analyzed the explanations from the reviewed works. In the following paragraphs, we discuss the most common mistakes or inappropriate explanations.

Figure from paper (Hryniewska, 2021)

In the first example, in Figure 2a), the model focuses on clavicles, scapulas, and soft tissues, which are outside the lungs. Very likely, the model predicts illness based on an improper part of the image. Location of the areas marked by explanation should be inside the chest on the lung tissue because COVID-19 lesions are not located on, e.g., lymph nodes. Moreover, there are some elements that cannot be considered as decision factors like imaging artifacts (cables, breathing tubes, image compression) or embedded markup symbols. To prevent the model from focusing on irrelevant features, in some studies, the lungs were segmented, and their background was removed. However, it may not help when some imaging artifacts are present in the area of the lungs.

The second example, in Figure 2b) shows that the model does not take the lesions into account. The model states that parts of the lungs other than the ones marked by the radiologist are relevant for model prediction. Explanations that “roughly indicate the infection location” are not acceptable for the robust model. The model should do this with the accuracy of the pixel marked by radiologists as relevant.

The third example, in Figure 2c), visualization is not clear. The study describes a different XAI method than the one present in the image. Moreover, this visualization highlights the whole image, and it is not possible to guess which features took part in the prediction. It is important to point out that some explanation methods can give clearer results for a specific type of DNN and for a specific domain.

The last example, in Figure 2d) is blurred. The image of the lungs is improperly taken, and the process should be repeated. The current image is useless for the accurate diagnosis process. Such images should be removed during data resource verification before model training.

Evaluation of explanation methods is crucial for confirmation of model trustworthiness. First of all, radiologists should validate a specific model with the help of XAI. They should assess the location, size, and shape of marked regions by explanation methods. Their interpretations should contain clear references to structures and lesions in the lungs.

Secondly, quantitative and qualitative evaluation of XAI methods is needed to assess the trustworthiness of a model. The model may take into greater consideration other image features than it should. To explore this kind of model mistake, other XAI methods ought to be used to obtain a comparison possibility.

In addition, a scale of colors presented on a heatmap should be placed next to the image so that there is no doubt as to what a particular colour on the heat map means.

The content of this article is taken from the work (Hryniewska, 2020). To read more about XAI for medical lung images, see this paper. If you use any part of this article, please cite:

Hryniewska, W., Bombiński, P., Szatkowski, P., Tomaszewska, P., Przelaskowski, A., & Biecek, P. (2021). Checklist for responsible deep learning modeling of medical images based on COVID-19 detection studies. Pattern Recognition, 118, 108035. https://doi.org/10.1016/j.patcog.2021.108035

--

--

Weronika Hryniewska
ResponsibleML

PhD Student at the Warsaw University of Technology. Interested in XAI and DL in medicine.