Intracranial Haemorrhage Detection using Deep Learning

13 min readFeb 18, 2020

In the blog, I present the work I had performed Kaggle competition aimed to detect the subtypes of acute intracranial hemorrhages in head CT scans using deep learning. The dataset is provided by the Radiological Society of North America (RSNA). The link to the competition can be found here.

Introduction

Intracranial hemorrhage refers to bleeding that occurs inside the cranium, which is a severe health problem requiring rapid and often intensive medical treatment. For example, intracranial hemorrhages account for approximately 10% of strokes in the U.S., where the stroke is the fifth-leading cause of death.¹

There are five hemorrhage subtypes: Intraparenchymal, Intraventricular, Subarachnoid, Subdural, and Epidural (refer to fig.1). Patients may exhibit more than one type of cerebral hemorrhage, which may appear in the same image.

Figure 1: Intracranial hemorrhage subtypes. [2]

While all acute (or new) hemorrhages appear dense (or white) on computed tomography (CT), the primary imaging features that help Radiologists determine the subtype of hemorrhage are the location, shape, and proximity to other structures.²

RSNA Intracranial Hemorrhage Detection challenge was launched on Kaggle in September 2019. The goal of the competition is to build an algorithm to detect acute intracranial hemorrhage and its subtypes.

As a patient can have more than one hemorrhage, this challenge boils down to a multi-label classification problem. In the next section, we give an overview of our approach.

Overview of Approach

The main goal was to perform a comparative study of the popular deep learning architectures. More specifically, Convolutional Neural Networks that performed well in the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in the past few years.

We compare the following models: VGG-16, ResNet50, Inception_v3, MobileNetV2, Xception, EfficientNet-B0,B2 and B3.

We built an end-to-end deep learning pipeline consisting of:
• Data handling, cleaning and pre-processing.
• Model implementation using Transfer learning.
• Model evaluation (using accuracy and AUC score)
• Comparative analysis of eight deep learning models

The following section goes into the detailed methodology in tacking this problem and building the model pipeline.

Methodology

Data description

The head CT scans are provided in DICOM format. DICOM (Digital Imaging and Communications in Medicine) is a standard for handling, storing, printing, and transmitting information in medical imaging.³

The training data is provided as a set of image IDs and multiple labels, one for each of five sub-types of hemorrhage, plus an additional label for any, which should always be true if any of the sub-type labels are true. There is also a target column, Label, indicating the probability of whether that type of hemorrhage exists in the indicated image. The size of the dataset is ≈ 180GB.

A single sample of the training dataset.²

Data cleaning and pre-processing

The following steps were performed in chronological order:

Removal of duplicates (for an image ID) in the train set.
Upsampling of Epidural subtype by 8 times (by repetitive concatenation of Epiduralpositive samples to the dataset). It was observed that the training data consisted of substantial class imbalance. The percentage of positive samples in the dataset:

After upsampling the Epidural class (by 8 times),

Note:
(A) The percentage of positive samples for other subtypes also went up as the images can have more than one cerebral hemorrhage. Thus, some upsampling was observed in other subtypes, as well.
(B) The % of negative samples, i.e., those who don’t contain any hemorrhage subtypes, is 82.74%. In hindsight, the number of negative DICOM images could have been undersampled to facilitate learning in our models. In the experiments below, we have a 1:4.8 ratio of positive samples to negative samples. This should have been reduced to 1:2.

3. Size of the dataset: The total size of the DICOM image dataset is ≈ 180 GB. It was not feasible to work with such large amounts of data for the project due to limitations in memory capacity and RAM of the machine in use.

Thus, using the MultilabelStratifiedShuffleSplit function provided by the iterative stratification package, we obtained 10% of the original data, stratified on the multi-labels. This ensured that the distribution of the labels in the new dataset remained the same as the original.

4. Train-Val split: This new dataset was further split into a training and validation set using the MultilabelStratifiedShuffleSplit function. A 90–10 split was performed yielding, 65859 images in training, and 11623 images in the validation set.

Windowing operation

Windowing, also known as grey-level mapping, contrast stretching, histogram modification or contrast enhancement is the process in which the CT image grayscale component of an image is manipulated via the CT numbers; doing this changes the appearance of the picture to highlight particular structures(such as the brain or soft tissue).

The brightness of the image is, adjusted via the window level (WL). The contrast is adjusted via the window width (WW). The WL is the midpoint of the range of the CT numbers displayed. The WW is the measure of the range of CT numbers that an image contains. When presented with a WW and WL one can calculate the upper and lower grey levels,i.e., values over x will be white, and values below y will be black.⁴

Using windows, we can highlight and emphasize specific voxels (3D-pixels). There are at least 5 windows that a radiologist goes through for each scan! These are:

1. Brain Matter: WW:80; WL:40
2. Blood/subdural: WW:130–300 ; WL:50–100
3. Soft tissue: WW:350{400 ; WL:20{60
4. Bone: WW:2800 ; WL:600
5. Grey-white differentiation: WW:8 ; WL:3²⁵

Three types of windows were used to focus (boldened), and each of them was assigned to a channel. Thus, while training, we construct images on the fly (with windows as channels) using a data generator.

Figure 2: Different windows highlight and capture distinct information. Each of these windows was used as a channel while feeding as input to the CNN (contrast with RGB channel in natural images).

The code for generating the above windows for a DICOM image is given below.¹⁰

def window_image(dcm, window_center, window_width):
    if (dcm.BitsStored == 12) and (dcm.PixelRepresentation == 0) and (int(dcm.RescaleIntercept) > -100):
        correct_dcm(dcm)
    
    img = dcm.pixel_array * dcm.RescaleSlope + dcm.RescaleIntercept
    img_min = window_center - window_width // 2
    img_max = window_center + window_width // 2
    img = np.clip(img, img_min, img_max)    return imgdef bsb_window(dcm):
    brain_img = window_image(dcm, 40, 80)
    subdural_img = window_image(dcm, 80, 200)
    soft_img = window_image(dcm, 40, 380)
    
    brain_img = (brain_img - 0) / 80
    subdural_img = (subdural_img - (-20)) / 200
    soft_img = (soft_img - (-150)) / 380
    bsb_img = np.array([brain_img, subdural_img, soft_img]).transpose(1,2,0)    return bsb_img

Thus, we successfully were able to extend the domain knowledge in medical image analysis to our deep learning solution by incorporating the windowing technique used by radiologists while analyzing CT scans.

Network Architectures

We now present the networks we used in our approach and highlight some of their key features. Table 1 lists all the deep learning architectures used in the project. It shows the number of parameters and performance results (Top-1 and Top-5 validation accuracy) on the popular ImageNet-1K dataset. We picked these networks to show the evolution in the architectures in chronological order. Some key features of these models are:

Table 1: Performance of the models on the ImageNet challenge.⁶

VGG16

The VGG-16 network was invented by researchers at the Visual Geometry Group (VGG). It consists of 13 convolutional and 3 fully-connected layers. It was the runner-up at the ILSVRC 2014 competition. The VGGNet became popular because of its uniform architecture. The convolutional layers in VGG-16 are all 3×3 convolutional layers with a stride size of 1 and the same padding, and the pooling layers are all 2×2 pooling layers with a stride size of 2. However, VGGNet consists of 138 million parameters, which can be challenging to handle.

VGG-16 network architecture

ResNet50

ResNet or Residual networks by Kaiming He et al introduced the concept of ‘identity shortcut connections’, a type of skip connection. They present a residual learning framework that allows the training of substantially deeper networks than those used previously. They won the 1st place on the ILSVRC 2015 classification task. It achieves a top-5 error rate of 3.57% (after ensembling residual nets) which surpasses the human-level performance on this dataset.

ResNet-50 network architecture.⁸

Inception V3

The winner of the ILSVRC 2014 competition was GoogLeNet(a.k.a. Inception V1) from Google. It achieved a top-5 error rate of 6.67%. Here, the Network In Network approach is heavily used. The inception module introduced by Google was focused on building wider and deeper networks while keeping the computational budget constant using 1×1 convolutions for dimensionality reduction. Their architecture (Inception V1) consisted of a 22 layer deep CNN but reduced the number of parameters from 60 million (AlexNet) to 4 million.

The Inception V3 is a successor to Inception-v1, with 24M parameters. The motivation for Inception-v2 and Inception-v3 is to avoid representational bottlenecks (this means drastically reducing the input dimensions of the next layer) and have more efficient computations by using factorization methods.⁷

Inception V3 architecture.⁹

MobileNetV2

MobileNet is a neural network architecture that runs very efficiently on mobile devices. It is created by the researchers at Google. The main idea behind MobileNetV1 was to use depthwise separable convolutions, which does approximately the same thing as traditional convolution but is much faster.

MobileNetV2 builds upon the ideas from MobileNetV1, using depthwise separable convolution as efficient building blocks. However, V2 introduces two new features to the architecture: 1) linear bottlenecks between the layers, and 2) shortcut connections between the bottlenecks.¹³

It only has 3.5M parameters. And yet, its performance on ImageNet is on par
with VGG16 (has 138M parameters, almost 40 times more than MobileNetV2). This shows the great strides made in algorithmic improvements and network design over the years.

MobileNetV2 network architecture.¹⁴

Xception

Xception, the eXtreme form of inception, is an extension of the Inception architecture which replaces the standard Inception modules with depthwise separable convolutions (i.e., a depthwise convolution followed by a pointwise convolution). It provides significant performance benefits owing to the reduction in both parameters and mult-add operations. It has a similar number of parameters as Inception-v1 (23M).

Xception network architecture.¹¹

EfficientNet

EfficientNet by Google introduced the concept of Compound Model Scaling. They proposed a novel model scaling method that uses a simple yet highly
effective compound coefficient to scale up CNNs in a more structured manner. Unlike conventional approaches that arbitrarily scale network dimensions, such as width, depth, and resolution, their method uniformly scales each dimension with a fixed set of scaling coefficients. The EfficientNet models achieve both higher accuracy and better efficiency over existing CNNs, reducing parameter size and FLOPS by an order of magnitude.¹⁵

Model Size vs. Accuracy Comparison. EfficientNet-B0 is the baseline network developed by AutoML MNAS, while Efficient-B1 to B7 are obtained by scaling up the baseline network.¹⁶

Results and Discussion

Table 2: Performance of the models on RSNA Intracranial Hemorrhage Detection.⁶

Table 2 lists all the deep learning architectures used in the project, their Top-1 Accuracies, AUC scores, and the number of parameters. Unfortunately, the difference in results is not as significant as what was expected. However, some trends can still be observed. These are:

EfficientNet-B3 was the best performing model in terms of accuracy and AUC score.
From the family of EfficientNets, we observe that with an increase in model capacity (number of parameters) and scale (depth, width, and resolution), there is an improvement in both accuracy and AUC score. In simpler terms, the metrics improve from B0 to B2 to B3.
VGG16 (the oldest network) performed the worst, even though it has the highest model capacity of 138M parameters.
MobileNet outperformed VGG16 by almost 1% in the AUC score with 40 times fewer parameters.
ResNet50 and EfficientNet-B0 show similar performance. EfficientNet-B0 has five times fewer parameters than ResNet50. These two networks have been compared in the EfficientNet paper and Table 1.
The Xception model performed on par with EfficientNet-B2 and B3. However, it has more parameters than both the models combined.
Xception being an extension of Inception V3 shows better empirical performance than it.

Implementation Details

The code for the entire project can be found here. In this section, we dive into the nitty-gritty of the code implementation and the reasoning behind some decisions.

Data Generator: This is used to circumvent the issue of being unable to load and process a massive dataset due to memory-space limitations. A data generator is used to generate batches of data in real-time and feed it to the network directly. In the data generator class implementation, we perform real-time image augmentation and Windowing operations as discussed earlier (i.e., obtaining the Brain, Subdural, and Soft tissue windows from the DICOM and using it as channels of the CT image).
Image augmentation: We perform horizontal flipping with a probability of 0.25, vertical flipping with a lower probability of 0.10 and cropping with a probability of 0.25. It was decided not to use other augmentations such as color jitter, affine transformation, rotations, etc. due to the sheer size of the training dataset.
Network modifications: The convolutional base of the deep learning models were used as a feature extractor. A single dense layer of 6 units (one unit for each class) was added to the convolutional base output (after Global average pooling). ImageNet pre-trained weights were used for all models. Thus, transfer learning is performed on our medical image dataset using Imagenet weights. Due to the significant difference in the image
domain between the Imagenet classes and head CT images, it was decided to fine-tune all the layers (i.e., all layer parameters were trainable and amenable to gradient updates).
Loss function: Binary cross-entropy loss was used. As it is a multi-label classification problem, a sigmoid activation function is used for the output layer. If it were a multiclass and not multi-label classification, a softmax activation with categorical cross-entropy loss would have been used. However, in our case, a single image can have more than one
cerebral hemorrhage. Thus, a sigmoid activation with a BCE was used.
Metrics: We track two metrics — the accuracy and the AUC score. Due to substantial class imbalance, accuracy can be misleading. As even for a dummy classifier (i.e., a classifier that always predicts the majority class, in our case, it is 0), the accuracy is high. Thus, we also keep track of the Area under the ROC curve, which considers the Precision-Recall tradeoff.
It essentially indicates how well the probabilities from the positive classes are separated from the negative classes.
Optimizer: Adam was used. The main difference between Adam and Vanilla Stochastic Gradient descent (SGD) is that Adam optimizer tries to determine the adaptive learning rate. Whereas, SGD assumes a fixed learning rate. Adaptive methods like Adam, estimate the change in gradient (i.e., the hessian) by keeping track of the past gradients (first moment) and past squared gradients (second moment). This leads to better updates and faster convergence while keeping the computational time similar to first-order methods (like SGD).
Epochs: Each model was trained for only 10 epochs. We aimed to compare the performance of different models and not to maximize the validation accuracy for each model.
Learning Rate: We kept a learning rate of 0.0001. A low learning rate was picked because high learning rates increase the risk of losing previous knowledge of the pre-trained parameters. As high learning rates lead to more extensive gradient updates. Thus, for fine-tuning the
network, we used a lower learning rate.
Batch size: We kept a fixed batch size of 32. Much research has shown that a mini-batch size of 2–32 yields more stable and generalizable results on multiple benchmarks than large mini-batches.¹²
Callbacks: We only used a ModelCheckpoint to save the model parameters having the best validation accuracy. No learning rate scheduler or early stopping was used.

Challenges Faced (Optional)

Dealing with big data and data handling was a huge issue. Our computational workflow was very inefficient for the problem at hand. This cost us experimentation time.
The final accuracies and AUC scores of the vast range of models were underwhelming. We had expected to see a wide range of results from the models. And we hoped to exhibit the difference in the empirical performance of the older networks compared to the newer ones.
The colossal training duration did not allow us to try different loss functions (such as weighted cross-entropy), upsampling, and downsampling techniques. Also, the models were run only for a few numbers of epochs.
We also could not perform any Ablation studies (due to training time) to see the difference between introducing/removing certain parts of the pipeline to understand the network’s behavior better.
In hindsight, the dataset we had chosen was the source of all our challenges

The code for the entire project can be found here.