Study of Vision Models for Chest X-Ray Analysis using Transfer Learning

9 min readJun 20, 2024

In my previous blog, I discussed CNNs for chest X-ray analysis and their performance. If you didn’t check it out, here is the link. In this new post, I will explore transfer learning, discussing its strategies and various pre-trained models. Additionally, we examine how each of these pre-trained models perform for the task of chest X-ray analysis.

Transfer Learning is a technique where knowledge gained from solving one task is applied to a different but related task. This approach saves time and computational resources by leveraging pre-trained models, which have already learned useful features from large datasets. By fine-tuning these models on specific tasks, we can enhance performance when the availability of labeled data is limited, especially in medical domian.

Several pre-trained CNNs are widely used for image classification, including VGG16, VGG19, ResNet50, InceptionV3, Xception, DenseNet, and EfficientNetV2B0, among others. These models have been trained on extensive image datasets and can be fine-tuned for specific tasks like chest X-ray analysis, leveraging their powerful feature extraction capabilities to improve performance.

When we plan to reuse a pre-trained model for our own need, we start by removing the original classifier, and add a new classifier that fits our purpose, and finally use one of the three strategies to fine-tune the model :

1. Train the entire model ,

2. Train Some layers and leave the others frozen,

3. Freeze the convolution base.

Transfer Learning Strategies (Source: Internet)

I experimented with the chest X-ray pneumonia dataset taken from Kaggle comprising 5,863 images of size (224,224,3) classified into 2 categories (Pneumonia/Normal) using a variety of pre-trained models, including VGG16, VGG19, ResNet50, XceptionNet, EfficientNetV2B0, InceptionV3, InceptionResNetV2, DenseNet121, and MobileNetV2. Before presenting the experimental results, it’s essential to delve into each of these models and elucidate their operational principles.

VGG16 (Visual Geometry Group)

VGG16 is a renowned deep convolutional neural network designed primarily for image classification tasks. This architecture consists of 16 layers of artificial neurons that process images progressively, enhancing the model’s predictive accuracy. VGG16 is characterized by its straightforward design principles, employing small kernel dimensions of 3x3, a stride of 1, and using same padding to preserve spatial dimensions. It also includes max-pooling layers with a 2x2 filter size and a stride of 2, which help in down sampling the feature maps while retaining important features. These foundational parameters contribute to VGG16’s effectiveness in capturing intricate features from images, making it a pivotal model in the field of deep learning-based image analysis.

VGG19 (Visual Geometry Group)

VGG19, like its predecessor VGG16, is a deep convolutional neural network designed for image classification tasks. The key difference lies in its architecture: VGG19 consists of 19 layers of neurons compared to VGG16’s 16 layers. This additional depth allows VGG19 to potentially capture more intricate patterns and features in images, which can lead to improved performance in tasks requiring high-level image understanding. Both models use small kernel sizes of 3x3, a stride of 1, and same padding, maintaining a similar basic structure. However, the deeper architecture of VGG19 may require more computational resources and training time compared to VGG16, but it can also offer enhanced capability to learn hierarchical representations of data. Overall, while VGG16 is efficient and widely used, VGG19 provides a deeper network architecture that can potentially yield better results in complex image classification tasks.

ResNet50 (Residual Networks)

ResNet50 is a widely acclaimed CNN architecture that has significantly advanced the field of deep learning. “ResNet” stands for residual network, a concept introduced to address the challenges of training very deep neural networks. The “50” in ResNet50 denotes its depth, specifically comprising 50 layers.

ResNet50 Architecture (Source: Internet)

Central to ResNet50’s innovation are its residual blocks, which incorporate skip connections or shortcuts. These connections enable the network to skip one or more layers, facilitating the direct flow of gradients during training. This addresses the vanishing gradient problem, a common issue in deep networks where gradients diminish as they propagate backward, hindering effective learning and leading to overfitting.

InceptionV3

InceptionV3 utilises an innovative Inception module that employs multiple convolutions of varying kernel sizes within the same layer. This approach allows the network to capture a wide range of features at different scales simultaneously, from fine details to broader patterns in images. By integrating 1x1, 3x3, and 5x5 convolutions, among others, InceptionV3 efficiently learns hierarchical representations, enhancing its capability for accurate image classification tasks.

The Inception architecture, seen in its various versions (A, B, C) and reduction modules (A, B), optimizes feature extraction by employing diverse convolutional operations within each module. These modules enable the network to capture information at multiple scales and dimensions effectively.

InceptionResNetV2

InceptionResNetV2 integrates the principles of the Inception architecture with the residual connections technique. The network comprises multiple Inception modules, each containing convolutional and pooling layers.

Unlike InceptionV3, InceptionResNetV2 enhances the architecture by replacing the filter concatenation stage with residual connections. This modification enables the network to learn residual features, effectively addressing the challenge of vanishing gradients during training. By incorporating residual connections, InceptionResNetV2 optimizes the learning process and enhances its capability to capture and utilize deep feature representations in tasks such as image classification and object recognition.

InceptionResNetV2 Architecture (Source: Internet)

XceptionNet

XceptionNet, short for Extreme Inception, is a convolutional neural network architecture that emphasizes depthwise separable convolutions. The key innovation of XceptionNet lies in its use of depthwise separable convolutions, which decompose the standard convolution operation into two separate stages: depthwise convolution and pointwise convolution. Depthwise convolution applies a single filter to each input channel separately, while pointwise convolution combines the outputs of the depthwise convolution using 1x1 convolutions across all channels.

This separation of spatial and channel-wise operations significantly reduces the number of parameters and computational complexity compared to traditional convolutional layers.

DepthWise Separable Convolution (Source: Internet)

Let’s consider a standard convolutional layer with the following parameters:

Number of kernels: 256, Kernel size: 3x3, Input size: 8x8

For standard convolution, the number of multiplications is:

Number of kernels × Kernel depth × Kernel width × Input depth × Input width = 256×3×3×3×8×8 = 1,107,456

Now, let’s calculate the number of multiplications for depthwise separable convolution using the same kernel size:

Depthwise Convolution (3x3):

Number of kernels × Kernel depth × Kernel width × Input depth × Input width = 3×3×3×8×8 = 17,280

Pointwise Convolution (1x1):

Number of kernels × Kernel depth × Kernel width × Input depth × Input width = 256×1×1×3×8×8 = 49,152

Total Multiplications for Depth wise Separable Convolution: 17,280 + 49,152 = 66,432

As calculated, depth wise separable convolution significantly reduces the number of multiplications compared to standard convolution

EfficientNetV2B0

EfficientNetV2B0 uses a compound scaling method to optimize neural network architecture by scaling depth, width, and resolution simultaneously. This balanced approach enhances both accuracy and efficiency across tasks like image classification. By using specific coefficients α , β , γ and a scaling factor φ, the model scales each dimension proportionally. Depth scaling adds more layers, width scaling increases channels per layer, and resolution scaling enlarges input images. This method ensures efficient resource utilization and superior performance, making EfficientNetV2B0 ideal for applications requiring both high accuracy and efficiency in deep learning.

DenseNet121

DenseNet, or Densely Connected Convolutional Networks, stands out among CNN architectures due to its highly interconnected structure where every layer is connected to every other layer. This design promotes robust feature propagation and reuse within dense blocks (Dn), ensuring each layer receives inputs from all preceding layers. Additionally, DenseNet utilizes bottleneck layers within each dense block to reduce computational overhead. These bottleneck layers employ 1x1 convolutions to compress feature maps before expanding them again with 3x3 convolutions, optimizing parameter efficiency without compromising feature learning capacity.

Transition blocks (Tn) are strategically placed between dense blocks to manage feature map dimensions and model complexity. These blocks typically include batch normalization, followed by 1x1 convolutions and 2x2 average pooling layers, which collectively downsample and prepare feature maps for the subsequent dense block. This architecture enhances both computational efficiency and model performance across various deep learning tasks.

DenseNet architecture (Source: Internet)

MobileNetV2

MobileNetV2 is a lightweight convolutional neural network designed for efficient mobile and embedded vision applications. It enhances both performance and computational efficiency over its predecessor, MobileNet, making it ideal for resource-constrained devices and real-time applications.

MobileNetV2 introduces inverted residual blocks with linear bottlenecks to optimize network architecture:

1. Expansion Layer: The input undergoes a lightweight 1x1 convolutional layer to increase the depth of input features, enhancing representation capability.

2. Depthwise Separable Convolution: Utilizes a depthwise separable convolution, combining depthwise convolution (per-channel operation) with pointwise convolution (across channels). This approach drastically reduces computational complexity while preserving feature richness.

3. Linear Bottleneck: Following the depthwise separable convolution, a 1x1 pointwise convolution reduces the expanded feature channels to enhance computational efficiency, known as the linear bottleneck layer.

4. Residual Connection: Incorporates a residual connection that skips the entire block, facilitating direct learning of residual features from input to output.

MobileNet Architecture (Source: Internet)

Now that we’ve explored various pretrained models and their unique characteristics, it’s time to evaluate their performance in chest X-ray analysis.

The highest accuracy of 90.38% was achieved by VGG16, surpassing expectations. VGG19, with a slightly lower accuracy of 85.9%, likely suffered from overfitting due to its more complex architecture compared to VGG16. Despite anticipating strong performance from ResNet50, InceptionNetV3, EfficientNetV2B0, and XceptionNet, my observations suggest that the dataset used, comprising only 5863 images, is relatively small. This limited dataset size posed challenges for these more complex models, resulting in lower performance compared to simpler architectures like VGG16. The complexity of these models may not have been fully supported by the dataset, impacting their training effectiveness and generalization.

Regarding DenseNet121, which performed well despite its complexity similar to ResNet50 and InceptionNetV3, its dense connectivity likely enabled the model to effectively leverage the limited dataset. By maximizing information flow between layers, DenseNet121 improved feature learning and model robustness. Additionally, the use of bottleneck layers in DenseNet121 reduced parameters, enhancing computational efficiency without sacrificing feature richness.

In analyzing the performance of InceptionNetV3, ResNet50, and InceptionResNetV2 on the chest X-ray dataset, InceptionResNetV2 stood out despite the similar architectures of InceptionNetV3 and ResNet50. The notable performance of InceptionResNetV2 can be attributed to its unique combination of Inception and residual features. The residual connections in InceptionResNetV2 facilitate smoother gradient flow during training, which likely helped address challenges posed by the dataset’s limited size.

Hope this analysis provides insights into pretrained models for chest X-ray analysis. Thanks for reading! In the next blog, we’ll delve into visual attention mechanisms and their impact on enhancing model interpretability and performance in medical imaging tasks. Stay tuned to explore these advanced techniques further !

Feel free to connect with me on https://www.linkedin.com/in/swathhy-yaganti/

Study of Vision Models for Chest X-Ray Analysis using Transfer Learning

Written by Swathhy Yaganti