Reviving the Classics: Why Convolutional Models Still Shine in the Age of Transformers.

Iva @ Tesla Institute
Artificialis
Published in
7 min readMay 17, 2024

The rise of Transformer models has taken the machine learning world by storm, overshadowing many other techniques that have proven their worth over time.

While transformers are powerful, it’s important to recognize that they are not the one-size-fits-all solution for every problem.

Convolutional Neural Networks (CNNs), for instance, remain highly effective for those involving image and spatial data.

Workflow

We have a scenario where we need to classify medical X-ray images to determine whether a bone is broken or not. In such tasks, convolutional neural networks shine due to their ability to learn spatial hierarchies from images.

TensorFlow Hub provides access to a wide range of pre-trained models in its repository, which can be easily integrated into the projects. Trained on large datasets, these models are optimized for various tasks, such as image classification, object detection, and natural language processing.

Using pre-trained models from TensorFlow Hub:

  1. No Cost: means you can access and train powerful machine learning models without any need to invest in the extensive computational resources needed to train these models from scratch.
  2. Training deep learning models can take days or even weeks. With pre-trained models, you can skip this lengthy process and get started with your specific task immediately.
  3. Pre-trained models are built and fine-tuned by experts, all the models are having high performance and reliability.

Here’s a detailed walkthrough for training a classification model using TensorFlow Hub.

Problem Definition & Dataset

Our goal is to classify X-ray images into two categories: broken and not broken. We’ll use the MURA (musculoskeletal radiographs) dataset, specifically focusing on wrist bone X-rays.

MURA is one of the largest public radiographic image datasets, with images manually labeled as normal or abnormal by board-certified radiologists. This dataset provides a robust foundation for training our model.

Link to dataset:

Transfer Learning Concept

Transfer Learning method cuts down on the computational resources and time required to train a high-performing model. Instead of training a model from scratch, we can use a pre-trained CNN, then we adapt this model for our specific task by freezing the convolutional layers (model have previously learned to detect edges, textures, and shapes) and adding new, task-specific layers on top.

This way, the model retains its powerful feature extraction capabilities while being fine-tuned to recognize the specific patterns associated with broken bones.

Pre-trained Model Selection

For this task, we will use the VGG-16 model, that’s the classic among pre-trained models. It is trained on the ImageNet dataset, which contains 14 million labeled images across 1000 classes, that’s a robust feature extractor we can further modify.

Model Architecture

The VGG-16 model consists of 16 convolutional layers followed by fully connected (FC) layer.

Initial Training

In the initial phase of training, all of the layers will be frozen during this phase, meaning their weights won’t be updated. The training focuses on training the new, added layers on top of the pre-trained base.

The layers added:

The flatten layer converts the 3D feature and maps to 1D, followed by dense, fully connected layer for classification.

The dropout layer helps in regularization by randomly setting input units to 0 during training, and prevents overfitting.

Early stopping technique monitors the validation loss and stops training if the loss doesn’t improve for a 10 number of epochs (patience), in a case the model doesn’t train for too long and start to overfit to the data.

Since our task only involves two categories (broken and not broken bones), which makes it binary, we’ll replace the original output layer with the new one that has a single node activated by the sigmoid function:

NOTE:

  • The last layer of VGG-16, which classifies into 1000 classes, is removed.
  • The convolutional base is kept frozen to retain the learned weights.
  • New dense activation layer is added on top of the VGG-16 base,for binary classification.

The Results and how to improve them

Initial training reached an accuracy of 0.6009 over 20 epochs.

Further improvements can be achieved through hyper-parameter tuning, fine-tuning the model, or using more advanced augmentation techniques.

In this section, we’ll dig into the details of each step involved in improving fracture detection model.

Hyper-parameter Tuning

For our model, we’ll focus on two primary hyper-parameters: learning rate and dropout rate besides the data augmentation.

Learning Rate controls how much the model’s weights are adjusted in response to the estimated error each time the model weights are updated. A smaller learning rate means the model weights are updated more slowly, which can lead to better performance.

For the initial training, we set the learning rate to 0.001. The value strikes a balance between making progress towards minimizing the loss and making sure the adjustments are not too drastic, which could lead to suboptimal convergence.

Dropout Rate is a regularization technique where randomly selected neurons are ignored during training, so heir contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the nodes on the backward pass.

We use a dropout rate of 0.5, meaning 50% of the neurons are dropped during each training step, as a prevention if overfitting happens so that the model doesn’t become too dependent on particular weights, and can generalize better to unseen data.

Data Augmentation

Data augmentation is used to increase the diversity of the training set by applying random transformations to the training data.

The model will generalize better by exposing it to a wide variety of scenarios and variations of the input data.

We incorporated several augmentation techniques using ImageDataGenerator:

  • Random Rotations: Images are rotated randomly within a specified range (e.g., 40 degrees). The model becomes invariant to the orientation of the input images.
  • Width and Height Shifts: Portions of the image are shifted horizontally and vertically by a fraction of the total width and height (e.g., 20%). The changes in the camera position helps the model focus on the content rather than the position.
  • Shear Transformations: Shearing involves shifting one part of an image in a direction different from the other part (e.g., 20%). The model will become robust to distortions.
  • Zoom Transformations: Randomly zooming into the image simulates changes in the distance between the camera and the object, helping the model to learn scale-invariant features.
  • Horizontal Flips: Randomly flipping images horizontally so that the model can handle left-to-right variations.
  • Fill Mode: This determines how the newly created pixels should be filled when an image is transformed. We used 'nearest' fill mode to fill the newly introduced pixels with the nearest pixel values from the original image.

Fine-tuning

After the initial training phase, we unfreeze the top 4 layers of the VGG-16 base model. The layers will now have their weights updated during training.

Fine-tuning will be done with setting a very low learning rate (1e-5) because we don’t want to drastically alter the weights of the pre-trained layers but rather make subtle adjustments to the model’s feature extraction capabilities.

This time we’re updating the weights of the top 4 convolutional layers along with the new classification layers. The model now can do better in learning the specific features and patterns in our X-ray images that are indicative of broken bones.

All techniques added: hyper-parameter tuning, data augmentation, and fine-tuning, and we’re include the Gradio interface later.

Create a Gradio interface

We will load the saved model weights and create gradio interface that accepts an X-ray image input, processes it through the model, and outputs the predictions:

Install Gradio if you haven’t already:

!pip3 install gradio

Our new and improved model we can now host as gradio application, where the predicted results are visible instantly:

the results are more appealing with Gradio interface

Key Points

  • TensorFlow Hub provides free access to powerful models and datasets.
  • Using pre-trained models reduces the computational resources and time required for training.
  • Integrating with Gradio unlocks the ability to present and explain the model’s performance interactively.

Conclusion

Despite the rise of Transformer models, Convolutional Neural Networks remain highly effective for tasks involving image and spatial data. Our example of identifying broken bones in X-ray images showcases the strengths of CNNs.

CNNs are designed for image processing, using convolutional layers to capture spatial hierarchies. CNNs can be optimized on hardware accelerators, which makes them faster and more cost-effective than Transformers for high-resolution images.

Our use of the VGG-16 model with transfer learning, improved with hyper-parameter tuning, and data augmentation, demonstrates how CNNs can achieve high performance with limited data.

Incorporating Gradio as an interactive interface for end-users to upload X-ray images and get instant feedback augments the model’s usability and accessibility.

THE CODE NOTEBOOK:

--

--

Iva @ Tesla Institute
Artificialis

hands-on hacks, theoretical dig-ins, and real-world know-how guides. sharing my notes along the way; 📝