Classification on medical images. What is good and what is bad?

8 min readOct 24, 2023

Overview

In this article, I discuss how computer vision can help with medical diagnostics on the example of pneumonia diagnostics using X-ray chest images. I use X-ray chest images from Kaggle dataset. This dataset contains three classes of images:

I’m solving the following tasks using this data:

1. create a system which can determine whether an input X-ray chest image belongs to class “Normal (no pneumonia)” or to class “Pneumonia (bacteria or virus)”, i.e. 2-class classifier.

2. create a system which can determine whether an input X-ray chest image belongs to class “Normal (no pneumonia)” or to class “Pneumonia-bacteria” or to class “Pneumonia-virus”, i.e. 3-class classifier.

The goal of this research is to find the best way for the diagnostic from X-ray chest images based on computer vision.

Training and test data. Input images preprocessing.

The dataset was split into a training set and a test set. The training set contained 3000 images — 1000 “Normal (no pneumonia)”, 1000 “Pneumonia-bacteria”, 1000 “Pneumonia-virus”, selected at random from their respective groups. The rest of the images composed the test set, which thus contained 2908 images — 576 “Normal (no pneumonia)”, 1777 “Pneumonia-bacteria”, 555 “Pneumonia-virus”.

Image examples — left to right: “Normal (no pneumonia)”, “Pneumonia-bacteria”, “Pneumonia-virus” — are shown below:

All images from the source dataset in the both “train” and “test” folders were cropped to remove the redundant areas and leave only the areas for diagnostic. The examples of the cropped images (“Normal (no pneumonia)”, “Pneumonia-bacteria”, “Pneumonia-virus” — from the left to the right):

The classifiers code was implemented using PyTorch. To prepare an input image the following torchvision transforms were applied to the input data img which initially was in the numpy array form:

img = transforms.ToTensor()(img)
img = transforms.Resize((256, 256))(img)
s, m = torch.std_mean(img, dim=(0, 1, 2))
img = transforms.Normalize(m, 2*s)(img)

The input image was transformed to torch tensor, resized to 256x256 resolution and normalized. I found that the normalization of unified cropped input data has improved the classifier quality by ~1%.

CNN models for classification tasks.

Results from two CNN models for each classifier are demonstrated below: the first model contains 2 convolution blocks and the second model contains 3 convolution blocks — showed the best quality. More complicated models showed overfitting and worse quality results on the test set. Code samples for each model, r_size = 256, num_classes = 2 or 3 depending on the set of interested classes:

2 convolution blocks:

class ChestClassifier(nn.Module):
    def __init__(self):

        super(ChestClassifier, self).__init__()
        nc = 32
        nc2 = nc * 2
        sz = int(r_size/16)

        self.cnn1 = nn.Sequential(
            nn.Conv2d(3, nc, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(nc),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size = 2, stride = 2),

            nn.Conv2d(nc, nc2, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(nc2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size = 2, stride = 2),
            nn.Dropout(0.4),
        )
        self.fc1 = nn.Sequential(
            nn.Linear(sz*sz*nc2, nc2),
            nn.ReLU(inplace=True),
            nn.Linear(nc2, 2),
        )

    def forward(self, x):
        out1 = self.cnn1(x)
        out1 = torch.flatten(out1, 1)
        output = self.fc1(out1)
        return output

3 convolution blocks:

class ChestClassifier(nn.Module):
    def __init__(self):

        super(ChestClassifier, self).__init__()
        nc = 24
        nc2 = nc * 2
        nc4 = nc * 4
        sz = int(r_size/32)

        self.cnn1 = nn.Sequential(
            nn.Conv2d(3, nc, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(nc),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size = 2, stride = 2),

            nn.Conv2d(nc, nc2, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(nc2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size = 2, stride = 2),
            nn.Dropout(0.2),

            nn.Conv2d(nc2, nc4, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(nc4),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size = 2, stride = 2),
            nn.Dropout(0.3),
        )
        self.fc1 = nn.Sequential(
            nn.Linear(sz*sz*nc4, nc2),
            nn.ReLU(inplace=True),
            nn.Linear(nc2, 2),
        )

    def forward(self, x):
        out1 = self.cnn1(x)
        out1 = torch.flatten(out1, 1)
        output = self.fc1(out1)
        return output

I used Adam optimizer and variable learning rate started from 0.001 and changed to 0.0001 during the training. Common note for all models which I trained and evaluated: since I didn’t want to waste data for validation set, I saved model each 10 epochs during the training to see the trend and to choose the best one after the training process and the verification with the test set.

2-class classifier to “Normal (no pneumonia)” or “Pneumonia (bacteria or virus)”. Trained model evaluation.

3000 images were used for the training, and 2908 images were used for the test. I applied technics to make training batches balanced (~50% images for each class). In the results below “Class 0” means “Normal (no pneumonia)” and “Class 1” means “Pneumonia (bacteria or virus)”.

The results for the model with 2 convolution blocks:

Test set:
Class 0:
true positive for the class 0: 530, false negative for the class 0: 46
Accuracy: 92.01% on class 0

Class 1:
true positive for the class 1: 2280, false negative for the class 1: 52
Accuracy: 97.77% on class 1

true count for the whole set: 2810, false count for the whole set: 98
Mean Accuracy on all classes: 94.89%

F-measure: 0.947166128657885

The results for the model with 3 convolution blocks — the best:

Test set:
Class 0:
true positive for the class 0: 543, false negative for the class 0: 33
Accuracy: 94.27% on class 0

Class 1:
true positive for the class 1: 2279, false negative for the class 1: 53
Accuracy: 97.73% on class 1

true count for the whole set: 2822, false count for the whole set: 86
Mean Accuracy on all classes: 96.00%

F-measure: 0.954051320945519

2-class classifier to “Normal (no pneumonia)” or “Pneumonia (bacteria or virus)”. “Small” trained model evaluation.

“Small” trained model refers to a model trained on a small amount of data. A small amount of data is a common issue for medical images. So, I limited the training set to 1000 images altogether, of which 500 images were taken from “Normal (no pneumonia)” class, and the other 500 — from both “Pneumonia” classes, with 250 from “Pneumonia (bacteria)” and 250 “Pneumonia (virus)”. The test set was the same — 2908 images. In the results below “Class 0” means “Normal (no pneumonia)” and “Class 1” means “Pneumonia (bacteria or virus)”.

The results for the model with 2 convolution blocks:

Test set:
Class 0:
true positive for the class 0: 542, false negative for the class 0: 34
Accuracy: 94.10% on class 0

Class 1:
true positive for the class 1: 2238, false negative for the class 1: 94
Accuracy: 95.97% on class 1

true count for the whole set: 2780, false count for the whole set: 128
Mean Accuracy on all classes: 95.03%

F-measure: 0.9332937637812435

The results for the model with 3 convolution blocks:

Test set:
Class 0:
true positive for the class 0: 542, false negative for the class 0: 34
Accuracy: 94.10% on class 0

Class 1:
true positive for the class 1: 2235, false negative for the class 1: 97
Accuracy: 95.84% on class 1

true count for the whole set: 2777, false count for the whole set: 131
Mean Accuracy on all classes: 94.97%

F-measure: 0.9318544993349989

“Small” model looks good in quality even in comparison with the model based on 3000 training images. This is because the most images of the class 0 look different from images of the class 1. In this case even several hundred images for each class in the training set are enough for a good quality of the model.

3-class classifier to “Normal (no pneumonia)” or “Pneumonia-bacteria” or “Pneumonia-virus”. Trained model evaluation.

3000 images were used for the training and 2908 images were used for the test. In the results below “Class 0” means “Normal (no pneumonia)”, “Class 1” means “Pneumonia-bacteria”, “Class 2” means “Pneumonia-virus”.

The results for the model with 2 convolution blocks:

Class 0:
true_count: 527, false_count: 49
Accuracy: 91.49%  on class 0
false assigning to the class 1: 9, false assigning to the class 2: 40 

Class 1:
true_count: 1279, false_count: 498
Accuracy: 71.98%  on class 1
false assigning to the class 0: 37, false assigning to the class 2: 461 

Class 2:
true_count: 390, false_count: 165
Accuracy: 70.27%  on class 2
false assigning to the class 0: 31, false assigning to the class 1: 134 

true_count: 2196, false_count: 712
Mean Accuracy on all classes: 77.91%

F-measure: 0.7463764556697652

The results for the model with 3 convolution blocks — the best:

Test set:
Class 0:
true_count: 537, false_count: 39
Accuracy: 93.23%  on class 0
false assigning to the class 1: 7, false assigning to the class 2: 32 

Class 1:
true_count: 1380, false_count: 397
Accuracy: 77.66%  on class 1
false assigning to the class 0: 31, false assigning to the class 2: 366 

Class 2:
true_count: 395, false_count: 160
Accuracy: 71.17%  on class 2
false assigning to the class 0: 36, false assigning to the class 1: 124 

true_count: 2312, false_count: 596
Mean Accuracy on all classes: 80.69%

F-measure: 0.7785463207941641

According to the results of the 3-class classifier, the kinds of pneumonia are not differentiated with a high accuracy. It follows from the results that the majority of recognition errors arise from poor differentiation between “Pneumonia-bacteria” and “Pneumonia-virus”. I asked my friend, a doctor, the following:

1. Are the doctors able to recognize “Pneumonia-bacteria” or “Pneumonia-virus” in X-ray chest images with high accuracy?

2. Is the classifier with 70+% accuracy for “Pneumonia-bacteria” and “Pneumonia-virus” classes useful for diagnostics?

Here is her answer: “A standard viral infection is practically invisible on a regular X-ray; there may be an increase in the pulmonary pattern (this is almost always the case in healthy stooped people), and an enlargement of the hilar lymph nodes. Bacterial gives a rounded shadow. If there is no shadow, but there is a clinic, they do a CT scan. On Computed Tomography, a sane radiologist applies diagnostic criteria and writes “high probability” of a viral or bacterial nature”. In her opinion, 2-class classifier “Normal (no pneumonia)” and “Pneumonia (bacteria or virus)” with a high accuracy is more useful for diagnostics.

Here are examples of classification results from trained model for 2-class classifier “Normal (no pneumonia)” and “Pneumonia (bacteria or virus)” on the several test images. The images were normalized. The text above each image contain a prediction value and a label value in the brackets:

The conclusion.

When medical images actually look different for different problem types, and a physician can identify the problem in the image with high accuracy, it is possible to implement a high-quality classifier for these problem types based on several hundred input images for each type. When a doctor cannot identify a problem based on an image alone, we cannot expect good performance from a classifier based on those images. More complicated methods including segmentation of abnormality areas are used for diagnostics.

Classification on medical images. What is good and what is bad?

Written by Olga Mindlina