Study of Vision Models for Chest X-Ray Analysis

5 min readJun 19, 2024

This is a series of Four blogs exploring different vision models for chest x-ray analysis:

1. Convolution Neural Networks (CNNs)

2. Transfer learning and Pre-trained models

3. Visual attention models

4. Vision Transformer (ViT)

Vision models, specifically Convolutional Neural Networks (CNNs), play a pivotal role in analysing chest X-rays for medical diagnosis. Chest X-rays are a common and non-invasive imaging technique used to visualize the chest’s internal structures, including the lungs, heart, and bones. These images are essential for diagnosing a variety of conditions such as pneumonia, tuberculosis, and lung cancer. Vision models have revolutionized medical imaging by providing automated, accurate, and efficient analysis of these X-rays. Here, we’ll explore how these models work, focusing on their application to chest X-ray analysis.

Convolutional Neural Networks (CNNS)

What are Convolutional Neural Networks ?

Convolutional Neural Networks (CNNs) are deep learning models specialized in processing visual data. They utilize convolutional layers which automatically extract hierarchical features from images. Images are a grid-like structure filled with pixel values, which indicate the intensity and darkness of each pixel. They mimic the structure and functionality of the human visual cortex, making them excel in tasks such as image classification, object detection, and semantic segmentation owing to their ability to capture spatial dependencies and invariant features, and perform remarkably in medical image computing.

Basic Components of CNN :

CNNs consist of several layers, each serving a specific function :

Convolutional Layers : This is the core block of a CNN where the main mathematical task called “Convolution” takes place. Convolution is the process of sliding a kernel or filter across an image to recognize patterns such as curves, shapes, edges, etc., which are the features of an image. This kernel or filter is a weight matrix applied to the input image to perform element-wise multiplication followed by summation. The resulting matrix of values represents a feature map, which highlights the presence and location of specific features within the image.

We use several filters of the same size on a single image to obtain different features. One filter might recognize curves, while another might detect edges. By using different filters, a CNN gathers various patterns in an image, enabling it to comprehensively understand and represent the visual information. This multi-filter approach allows the CNN to build a rich set of feature maps, each capturing distinct aspects of the image, which are crucial for accurate analysis and interpretation.

Process of Convolution Operation (Image Source: Internet)

In addition to filters, the concept of “stride” is important in CNNs. Stride refers to the number of pixels the filter moves/slides across the image. A stride of 1 results in detailed, overlapping feature maps, while a larger stride, like 2, produces smaller, more compressed feature maps. Adjusting the stride helps control the spatial resolution of feature maps and balances detail with computational efficiency.

Strided Convolution (Image Source: Internet)

After each convolution operation, a ReLU activation function is applied. This ReLU activation helps the model learn non-linear relationships between the image features. The ReLU activation function is defined as: max(0, value).

Pooling Layer : After the convolutional layer, the pooling layer follows to extract key features from the previous convolutional feature maps. Common pooling operations include:

Max Pooling: Selects the maximum value from each feature map, highlighting the most active features.
Average Pooling: Computes the average value of feature map values in each region, smoothing out the representation.
Sum Pooling: Aggregates the sum of feature map values in each region, providing a cumulative measure of feature strength.

Different Pooling Operations (Source : Internet)

Fully Connected layer : This final component of the CNN architecture involves flattening the resulting output matrix into a 1-D matrix. This transformation prepares the data for tasks like classification, with the inclusion of ReLU activation. Ultimately, a SoftMax activation function is applied to produce the probabilities corresponding to the output labels.

Architecture of Convolutional Neural Network (Source: Internet)

The above figure illustrates the architectural overview of Convolutional Neural Network for Chest X-Ray Analysis.

For this study, I utilized a chest X-ray pneumonia dataset from Kaggle comprising 5,863 images of size (224,224,3) classified into 2 categories (Pneumonia/Normal). I experimented with 9 different CNN models as follows :

Model Configurations of Various CNN Architectures

This table presents the detailed configurations of different CNN architectures used in the study, showcasing variations in convolutional layers, filters, kernel sizes, fully connected layers, and additional techniques such as batch normalization and dropout.

Experimental Results of CNN Architectures

This table summarizes the experimental results in terms of accuracy, precision, recall, and F1 score for each configuration of CNN architectures tested in the study. The variations in convolutional layers, filters, kernel sizes, and additional techniques like batch normalization and dropout are reflected in the performance metrics.

Based on these experiments, my observations are as follows:

Depth of the Network ( # Conv Layers) :

Increasing the number of convolutional layers from 1 to 2 generally improved performance. This is seen when comparing Experiments 1 and 3, and Experiments 2 and 5.
A substantial improvement was observed with six convolutional layers, indicating that deeper models can learn more complex features, leading to better performance.

Filter Size:

Larger filters (5x5) generally provided better performance compared to smaller filters (3x3). This is evident from comparing Experiments 1 and 2, and Experiments 4 and 5.
Larger filters might be capturing more contextual information for pneumonia detection in chest X-rays.

Batch Normalization and Drop out :

Adding Batch Normalization and Dropout (Experiment 7) improved performance compared to the same architecture without these techniques (Experiment 6).
Regularization techniques (Batch Norm and Dropout) help in preventing overfitting and improving model generalization.
The combination of Batch Norm and Dropout in deep models (six layers) led to more stable and improved performance, especially in terms of precision and recall.

In the next blog, I will explore transfer learning and the application of pre-trained models to chest X-ray analysis. Thank you for reading! I hope you found this helpful.

For Collaborative work, please reach out to me at https://www.linkedin.com/in/swathhy-yaganti/

Study of Vision Models for Chest X-Ray Analysis

Written by Swathhy Yaganti