Prepare responsible model for COVID-19 patients’ lung images

Weronika Hryniewska
ResponsibleML
Published in
11 min readJul 12, 2021

A systematic analysis of many deep neural network models for COVID-19 classification with modules for explainability reveals numerous mistakes made at different stages of data acquisition, model development, and explanation construction. Here, we would like to focus on a model development process. (If you would like to see the previous one — about data acquisition, please take a look at the link at the bottom of this page.)

How to prepare a responsible model? To answer this question, we should look closer to the domain of our research — medicine, radiology. Better knowledge of the domain will help us to prepare a model which will classify correctly lung images based on proper features — it is lesions.

In the beginning, you should decide which kind of problem do you want to solve. There were many approaches, please take a look at the table below.

Please note that references in square brackets, e.g. [1], are to stress the number of papers that used mentioned method. To find out to which work this is a reference, please take a look at: (Hryniewska, 2020).

Figure from paper (Hryniewska, 2021)

Main takeouts from this section:

  • There are not any strict guidelines on how many classes the classification should be conducted on. In the reviewed studies, there were up to 5 different classes: normal, pneumonia, virus, bacteria, and COVID-19.
  • Segmentation can be regarded as an image preparation technique for the further classification process. The lungs are segmented to remove the unnecessary background because, based on medical experience, the lesions caused by COVID-19 are not located outside the lungs.
  • Lesion segmentation, also called infection mask, helps to train the model to recognize infected regions and can be beneficial for further model assessment.
  • Multi-label classification is a very promising approach because with XAI visualizations it should be clearly visible for radiologists if a model learned to recognize proper lesions. It assigns labels to lesions’ names that point out which changes in the lungs are present in the image.

Image preprocessing methods

The aim of preprocessing is to make the images from different data resources look homogeneous and coherent. This process reduces the possibility of bias via eliminating some artifacts from images, such as captions, and annotations, which may deceive the model. The model should learn how to differentiate labels by focusing on image features, not by recognizing from which database the image comes from. During preprocessing, irrelevant image features that are easier to learn are removed. This is because in some databases there are no cases of people suffering from COVID-19, while in others there are, for example, only serious cases. These differences, which are insignificant from a human perspective, must be eliminated. For machines, even the information that images from one data resource are relatively darker might be relevant.

However, due to a large amount of data, automation of preprocessing is necessary. Preprocessing cannot introduce any changes in an image which will add or remove some relevant information. Its purpose is to make it impossible to identify the machine or characteristic machine’s calibration parameters, e.g., the dose of exposure.

Table 3 lists preprocessing techniques used in the reviewed studies. The most common was resizing an image to the same size. It is the most basic operation needed to train DNN when images have different sizes. Other techniques applied frequently to images were: normalizing pixel intensity, changing color space, eliminating noise, equalizing histogram, and performing image enhancement.

Table from paper (Hryniewska, 2021)

Cropping, changing color space, proportionally resizing, or zooming can be helpful to adjust images for training on specific network architecture, or the easiest way to remove some descriptions from the edges of images. If not required, resizing ought to be omitted. Normalizing pixel intensity or equalizing histograms are required to eliminate strong correlations with specific machine settings. Spot changes, such as noise removal, are not desirable. These techniques can be used only very carefully in order not to remove important features, such as lesions or parts of them.

To sum up, preprocessing is an important step preceding model training. It should reduce the possibility of bias and guarantee more homogeneous images without the elimination of any medically significant features.

Data augmentation

Data augmentation for ML is a technique that artificially multiplies the number of images through cropping and transforming existing images or creating new synthetic images thanks to generative adversarial networks. This procedure may help to reduce model overfitting and the problem of class imbalance. It helps in achieving a larger training dataset and more robust models.

In Table 4, we summarized data augmentation techniques from the reviewed studies. The most popular augmentation techniques in the reviewed studies are affine transformations, such as rotation, scaling or zooming, flip, and shifting or translation. On the contrary, splitting a radiological image into overlapping patches, or generating new content via a type of Generative Adversarial Network are rarely used.

Table from paper (Hryniewska, 2021)

However, not all of them are appropriate from a medical point of view. Before an augmentation, it is recommended to consider the ‘safety’ for the chosen domain. For example, the rotation should be done carefully, because some parts of the lungs, such as costophrenic recesses, may be placed outside the image. Also, change of brightness or contrast should be performed only in a limited manner, as greater manipulation may obscure lung structure. Moreover, in predicting COVID-19, it is acceptable to crop or proportionally scale/zoom an image to such an extent that it displays only the lungs without a background or other parts of the body.

It is also worth noting, that in the case of CT and X-ray images, the augmentation based on rotation or flipping generate photos that cannot naturally appear in real datasets, because the process of taking the photo itself is standardized. Horizontal flips should be done carefully, with some specific limitations. Most pathologies will be present similarly on the left or right lung, except for the change in shape of the heart (like in dextrocardia) or pathologies affecting specific lobes, due to different lung anatomy (like lobar pneumonia or lobar atelectasis). These limitations should be taken into consideration in model design.

In general, all augmentation methods should be consulted with radiologists, as domain knowledge is crucial. In every project, it is important to know the field of research to avoid a situation in which instead of solving the problem, bias is accidentally introduced.

Model architecture

In the studies different approaches of modeling were applied. Some benefited from machine learning methods, whereas the rest used deep learning. In the first case, simple classifiers or their ensembles were applied: AdaBoost [41], KNN [41], Naive Bayes [41], SVM [41].

In the reviewed studies, lung-specific model architectures (own models) were relatively often used for classification, whereas the existing architectures were frequently fine-tuned. The following model architectures or their fine-tuned, modified versions were investigated:

  • ResNet [29](ResNet18 [31, 42, 47, 50], ResNet34 [32, 42, 46], ResNet50 [30, 39, 40, 45, 46]),
  • DenseNet [29, 43, 48, 50](DenseNet121 [40, 49], DenseNet-161 [42], DenseNet-201 [39, 46, 49],
  • VGG [29, 50] (VGG-16 [27, 39, 49, 51], VGG-19 [39, 46, 49]),
  • Inception [50], InceptionV3 [35, 42],
  • InceptionResNetV2 [39, 42, 49],
  • MobileNet [49], MobileNetV2 [39, 49],
  • EfficientNet-B0 [46], Efficient TBCNN [40],
  • NASNetMobile [39, 49], NASNetLarge [49],
  • Res2Net [38],
  • Attention-56 [49],
  • ResNet15V2 [39], ResNet50V2 [44],
  • ResNeXt [42],
  • WideResNet [42],
  • Xception[49],
  • own model [28, 33, 41, 36, 37, 46, 34, 50].

It is clearly visible that there are numerous types of neural networks. Different neural networks can catch different dependencies in the data. For solving a problem, many types of model architectures are tested to find the best one for a specific task. Recommendations on how the explanations should look do not depend on the neural network architecture.

For segmentation, the following architectures were used:

  • U-net [34, 45, 50],
  • Nested version of Unet (Unet++) [50],
  • VB-Net [32],
  • VGG-16 backbone + enhanced feature module [38],
  • (FC)-DenseNet-103 [31],
  • AutoEncoder [47].

During the segmentation process, it is important that the lungs are accurately segmented. Otherwise, distorted border lines can be an indication of pathology. In study [31], the authors were aware that their segmentation cut pathological changes in lungs. In study [50], segmentation for non-domain experts appears accurate. However, radiologists noticed that also other structures (i.e., bowel loops) were interpreted as lungs in that segmentation.

There are multiple purposes for creating new model architectures. The most common is adjusting existing architectures for better explainability or scalability for training on medical COVID-19 imaging. The proposed architectures are usually smaller and require a lower number of trainable parameters than in well-known DNN architectures.

Six studies published their code on GitHub: [31, 35, 37, 40, 43, 48]. Other studies did not include any reference to their code or model.

Often the prediction from multiple models is combined to improve the overall performance. However, surprisingly, in the reviewed studies, there were not many ensemble models.

Transfer learning

Transfer learning is an ML technique about reusing gained knowledge from one problem to a new one. In the reviewed studies, it is commonly used when the neural network has a large number of parameters or the number of collected samples is too small for a specific task. In such a case, fewer training epochs are needed to adjust the model to a particular task. There are several popular image databases: ImageNet and, NoisyStudent for which various architectures of pre-trained neural networks are available. 12 out of 25 studies decided to use a neural network pre-trained on ImageNet for transfer learning. Therefore, it can be said that this is a very common procedure.

However, as (Cheplygina, 2018) shows, it is not clear whether using ImageNet for transfer learning in medical imaging is the best strategy. ImageNet consists of natural images. Meanwhile, medicine is an entirely different field and is completely unrelated. (Kim, 2017) also stressed the fact that the features which are extracted by models pre-trained on ImageNet can introduce bias.

Only in three of the reviewed studies was transfer learning conducted on lung images. The chosen datasets included 112,120 in [48], 88,079 in [43], and 951 in [50] non-COVID-19 lung images. The study [29] did not perform any transfer learning because lung images lack colorful patterns, specific geometrical forms, or similar shapes. The amount of redundant information introduced by a network pre-trained with color images may seriously affect the learning process on gray level images. In study [40], the authors discovered that the model has better performance when pre-trained on ImageNet than without it. However, the authors found out that their models pre-trained on ImageNet were using irrelevant markers on lung images while making a prediction.

Especially when the model is trained on a small amount of data, the usage of completely irrelevant features from another pre-trained model may increase model accuracy/result. For this reason, it is crucial to find a large database with images similar in domain and appearance to limit the possibility of irrelevant markers that take part in a prediction. It is recommended to train a neural network on this database and then use transfer learning to adjust it to the target task.

For transfer learning, it is recommended to take into consideration the following X-ray data sources with DICOM images (consider the fact that, in some of them, children and adults lungs are mixed):

For transfer learning on CT, the following data sources are available:

Training parameters

The selection of hyperparameters has a large impact on model results. Nevertheless, the process of tuning parameters is empirical and depends on the model architecture. For this reason, it is difficult to present a set of parameters adequate for every model architecture. However, there are several tips that can be used for most models.

Often the learning rate is decreased during the training process. Sometimes callback functions are used to halt training, when the result of a model is optimum, and during the training process, to save and store the best model and its parameters. The most typically used optimizer is Adam. The batch size of images during model training is between 2 and 81 with the most common value 8.

The whole image dataset is typically divided into 3 or 2 sets, most commonly into: training set 80%, validation set 10% and testing set 10%. Proportion 80% to 20% was the most frequently used among divisions into training and testing set respectively.

In study [39], the recommendation to conduct external validation is indicated, meaning an evaluation on an independent database. Another public dataset will be the best choice for cross-database validation [31]. However, in the reviewed studies cross-validation is the most frequently used. It is a common choice for training on a small amount of data resources. The problem which may occur during cross-validation is overfitting to the data. For this reason, validation on an external resource is the most trustworthy method.

Model performance

Evaluation metrics are commonly used to compare different models. For DNN image classification, there are many metrics frequently used for model quality assessment. In the reviewed studies, we discovered a large variability in the number of reported metrics. It is a common situation due to the fact that there are no detailed recommendations as to which performance metric should be used. We recommend the instructions presented in (Albahri, 2020), but, unfortunately, in almost all the reviewed studies, at least one metric out of these recommendations was missing.

Based on the rules described in study (Albahri, 2020), there are six evaluation criteria for binary classification: accuracy, precision, recall (sensitivity), F score, specificity, AUC. For multi-class classification, there are eight criteria: average accuracy, error rate, precision_mu, recall_mu, F score_mu, precision_M, Recall_M, F score_M, and for multi-label classification four criteria: exact match ratio, labeling F score, retrieval F score, Hamming loss.

There is another important factor which indicates why more than one evaluation metric should be used. It provides the opportunity to compare model architectures and then choose the best one for a given problem. Nevertheless, the models were not trained on the same images. Some databases contained only severe cases which were easier to classify [34]. Even if studies refer to the same data resources, it is possible that the amount of data has increased over time. For this reason, it is rather difficult to make a reliable comparison. The most trustworthy way to compare different model architectures is to look at studies that tested many of them, i.e.[34, 39, 41, 46, 49, 50].

Summary

The sudden outbreak of the COVID-19 pandemic has shown us how we need effective tools to support physicians. The development of a model which analyzes lung images is a complex process. Deep neural networks can offer much in the analysis of lung images, but responsible modeling requires very thorough model validation. The preparation of responsible deep learning models requires collaboration with domain specialists, in this case radiologists.

The content of this article is taken from the work (Hryniewska, 2020). To read more about preparation for the model training, see this paper. If you use any part of this article, please cite:

Hryniewska, W., Bombiński, P., Szatkowski, P., Tomaszewska, P., Przelaskowski, A., & Biecek, P. (2021). Checklist for responsible deep learning modeling of medical images based on COVID-19 detection studies. Pattern Recognition, 118, 108035. https://doi.org/10.1016/j.patcog.2021.108035

This is the second part of the series about responsible deep learning modeling of medical images based on COVID-19 detection studies. To see the first part, please visit:

--

--

Weronika Hryniewska
ResponsibleML

PhD Student at the Warsaw University of Technology. Interested in XAI and DL in medicine.