How to Train a Custom Vision Transformer (ViT) Image Classifier to Help Endoscopists in Less than 5 min

7 min readSep 2, 2021

Information: The code is available on the HugsVision repository

Author: Yanis Labrak, Research Intern — Machine Learning in Healthcare @ Zenidoc and Laboratoire Informatique d’Avignon

Our goal is to train an Image Classifier model based on the Transformer architecture in a way to help endoscopists to automate detection of various anatomical landmarks, phatological findings or endoscopic procedures in the gastrointestinal tract.

I will cover the following material and you can jump in wherever you are in the process of creating your classifier model:

Installing HugsVision
Downloading the Kvasir V2 dataset ~2.3 GB and Loading It
Choosing an Image Classifier model on HuggingFace
About Vision Transformer (ViT) Architecture
Setting-up the Trainer and start the Fine-Tuning
Evaluating the Performance of the Model
Using HuggingFace to Run Inference on images
Conclusion & Citations

Installing HugsVision

HugsVision is an open-source and easy to use all-in-one HuggingFace wrapper for computer vision.

The goal is to create a fast, flexible and user-friendly toolkit that can be used to easily develop state-of-the-art computer vision technologies, including systems for Image Classification, Semantic Segmentation, Object Detection, Image Generation, Denoising and much more.

Setup the anaconda environment:

Anaconda is a good way to reduce compatibility issues between packages versions for all your projects by providing you an isolated python environment.

Install HugsVision from PyPi:

Installing HugsVision from our PyPi repository providing you a fast way to install our toolkit without worring about dependencies conflicts.

Downloading the Kvasir V2 dataset ~2.3 GB

Since we are doing supervised learning, we need have a dataset to train on.

In our case, we will use an open source one called Kvasir Dataset v2 which weights around ~2.3 GB.

The dataset is composed of 8 classes for which we have 1,000 images each, for a total of 8,000 images.

The JPEG images are stored in the separate folders named accordingly to the class they belong to.

Each class showing anatomical landmarks, phatological findings or endoscopic procedures in the gastrointestinal tract.

Kvasir V2 Dataset Data Sample For Each Classes

Loading the dataset

Once it has been converted, we can start loading the data.

The fist parameter is the path to the dataset folder.

The second defines the size in percentage of the test dataset.

The third one allows to balance the number of documents in each classes for the training dataset.

The last one enables the data augmentation, which randomly changes the contrast of the images.

Choose a image classifier model on HuggingFace

The Hugging Face transformers package is a very popular Python library which provides access to the HuggingFace Hub where we can find a lot of pretrained models and pipelines for a variety of tasks in domains such as Natural Language Processing (NLP), Computer Vision (CV) or Automatic Speech Recognition (ASR).

Now we can choose our base model on which we will perform a fine-tuning to make it fit our needs.

Fine-tuning is the basic step of pursuing the training phase of a generic model which as been pre-trained on a close (image classification here) but on a larger amount of data.

In many tasks, this approach has shown better results than training a model from scratch using the targeted data.

In our case, ViT has been pre-trained on ImageNet-21k (14 million images, 21,843 classes) and ImageNet 2012 (1 million images, 1,000 classes).

Using a pre-trained model makes our training phase:

Faster because we are training only the classification layer and freezing the other ones.
And more effective, due to the already trained embeddings.

So, to be sure that the model will be compatible with HugsVision we need to have a model exported in PyTorch and compatible with the image-classification task obviously.

Models available with this criteria are available here.

At the time I’am writing this, I recommand to use the following models:

google/vit-base-patch16–224-in21k
google/vit-base-patch16–224
facebook/deit-base-distilled-patch16–224
microsoft/beit-base-patch16–224

Note: Please specify ignore_mismatched_sizes=True for both model and feature_extractor if you aren’t using the following model.

About Vision Transformer (ViT) Architecture

Transformers architecture is based on the concept of self-attention which allows the model to be more aware about the context than the previous architecture like CNN or LSTM do.

Its transcribe by huge increase of performances in the natural language processing (NLP) industry after the release of the paper “Attention Is All You Need” and also provides a huge bump in number of parameters.

The release of models such as BERT or GPT-2 make this improvement accessible to everyone whatever their backgrounds.

However, Computer Vision (CV) doesn’t benefit of this improvement due to the complexity related to the application of the attention mecanism on large scale images. Most of the previous tries make the models almost unusable when the images dimensions need to be relatively important.

This is why the Vision Transformers (ViT) came to us.

Globally, ViT architecture is based on BERT one’s and process the classification task as a NLP problem.

They solve the dimension issue by splitting the image into patches of fixed sizes, which will be transform to 1D embeddings or pass into a CNN to extract patches features maps.

Theses embeddings are then feeded to the transformer encoder accompanied by a “positional embeddings” which allows us to keep track of the patches order and a “classification token [CLS]” like in BERT.

The outputs are then converted into tokens and processed like they are in their NLP counterpart.

Setup the Trainer and start the fine-tuning

So, once the model choosen, we can start building the Trainer and start the fine-tuning.

Note: Import the FeatureExtractor and ForImageClassification according to your previous choice.

Evaluate the performance of the model

The use of the F1-Score metric will allow us to get a better representation of predictions for all the labels and find out if their are any anomalies with a specific label.

The output is:

But how F1-Score is calculated ?

Draw the confusion matrix:

The F1-Score is a nice way to get an overview of the results but it’s not enough to be able to deeply understand the reason of theses errors.

Errors can be caused by an imbalanced dataset, a lack of data or even a high proximity between classes.

Knowing classification confusions between classes may help to understand the decision or fix the model.

Using HuggingFace to Run Inference on images

First, rename the ./out/MODEL_PATH/config.json file present in the model output to ./out/MODEL_PATH/preprocessor_config.json

Conclusion

I hoped you enjoyed using HugsVision to train your custom Vision Transformer (ViT) Image Classifier!

Big thanks to Pr. Richard Dufour from the University of Nantes (LS2N) to helping me writing this article.

If you find this tutorial informative please consider claping it 👏👏👏

If you fall in love with HugsVision, please star ⭐ the project, we have more tutorials available on the GitHub page. Don’t hesitate to let us know more about your feedback or creating issues.

Stay tuned for our next article.

Citations

KVASIR: A Multi-Class Image Dataset for Computer Aided Gastrointestinal Disease Detection. In Proceedings of the 8th ACM on Multimedia Systems Conference (pp. 164–169). ACM, 2017. Pogorelov, Konstantin, Kristin Ranheim, Randel, Carsten, Griwodz, Sigrun Losada, Eskeland, Thomas, Lange, Dag, Johansen, Concetto, Spampinato, Duc-Tien, Dang-Nguyen, Mathias, Lux, Peter Thelin, Schmidt, Michael, Riegler, and Pål, Halvorsen.
ViT (from Google Research, Brain Team) released with the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
DeiT (from Facebook AI and Sorbonne University) released with the paper Training data-efficient image transformers & distillation through attention by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
BEiT (from Microsoft Research) released with the paper BEIT: BERT Pre-Training of Image Transformers by Hangbo Bao, Li Dong and Furu Wei.
HuggingFace’s Transformers: State-of-the-art Natural Language Processing. (2020). Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush.
F-score. (2021, August 24). In Wikipedia.