Exploring Vision Transformers (ViT) with 🤗 Huggingface

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2021)

Affandy Fahrizain
Data Folks Indonesia
8 min readOct 14, 2022

--

Lately, I was working on a course project where we asked to review one of the modern DL papers from top latest conferences and make an experimental test with our own dataset. So, here I am thrilled to share with you about my exploration!

Photo by Alex Litvin on Unsplash

Background

As self-attention based model like Transformers has successfully become a standard in NLP area, it triggers researchers to adapt attention-based models in Computer Vision too. There were different evidences, such as combine CNN with self-attention and completely replace Convolutions. While this selected paper belongs to the latter aproach.

The application of attention mechanism in images requires each pixel attends to every other pixel, which indeed requires expensive computation. Hence, several techniques have been applied such as self-attention only in local neighborhoods [1], using local multihead dot product self-attention blocks to completely replace convolutions [2][3][4], postprocessing CNN outputs using self- attention [5][6], etc. Although shown promising results, these techniques quite hard to be scaled and requires complex engineering to be implemented efficiently on hardware accelerators.

On the other hand, Transformers model is based on MLP networks, it has more computational efficiency and scalability, making its possible to train big models with over 100B parameters.

Methods

General architecture of ViT. Taken from the original paper (Dosovitskiy et al., 2021)

The original Transformers model treat its input as sequences which very different approach with CNN, hence the inputted images need to be extracted into fixed-size patches and flattened. Similar to BERT [CLS] token, the so-called classification token will be added into the beginning of the sequences, which will serve as image representation and later will be fed into classification head. Finally, to retain the positional information of the sequences, positional embedding will be added to each patch.

The authors designed model following the original Transformers as close as possible. The proposed model then called as Vision Transfomers (ViT).

Experiments

The authors released 3 variants of ViT; ViT-Base, ViT-Large, and ViT-Huge with different number of layers, hidden layers, MLP size, attention heads, and number of params. All of these are pretrained on large dataset such as ImageNet, ImageNet-21k, and JFT.

In the original paper, the author compared ViT with ResNet based models like BiT. The result shows ViT outperform ResNet based models while taking less computational resources to pretrain.

The following section will become technical part where we will use 🤗 Huggingface implementation of ViT to finetune our selected dataset.

🤗 Huggingface in Action

Now, let’s do interesting part. Here we will finetune ViT-Base using Shoe vs Sandal vs Boot dataset publicly available in Kaggle and examine its performance.

First, lets load the dataset using 🤗 Datasets.

Lets examine some of our dataset

Few of our dataset looks like

Next, as we already know, we need to transform our images into fixed-size patches and flatten it. We also need to add positional encoding and the classification token. Here we will use 🤗 Huggingface Feature Extractor module which do all mechanism for us!

This Feature Extractor is just like Tokenizer in NLP. Let’s now import the pretrained ViT and use it as Feature Extractor, then we will examine the outputs of processed image. Here we will use pretrained ViT with patch_size=16 and pretrained on ImageNet21K dataset with resolution 224x224.

Our extracted features looks like

Note that our original image has white background, that’s why our extracted features having a lot of 1. value. Don’t worry, its normal, everything will be work :)

Let’s proceed to the next step. Now we want to implement this feature extractor to the whole of our dataset. Generally, we could use .map() function from 🤗 Huggingface, but in this case it would be slow and time consuming. Instead, we will use .with_transform() function which will do transformation on the fly!

OK, so far we’re good. Next, let’s define our data collator function and evaluation metrics.

Now, let’s load the model. Remember that we have 3 labels in our data, and we attach it as our model parameters, so we will have ViT with classification head output of 3.

Let’s have some fun before we finetune our model! (This step is optional, if you want to jump into fine-tuning step, you can skip this section).

I am quite interested to see ViT performance in zero-shot scenario. In case you are unfamiliar with zero-shot term, it just barely use pretrained model to predict our new images. Keep in mind that most of pretrained model are trained on large datasets, so in zero-shot scenario we want to take benefit from those large dataset for our model to identify features in another image that might haven’t see it before and then make a prediction. Let’s just see how it works in the code!

In short, we put our transformed data in DataLoader which going to be transformed on the fly. Then, for every batch, we pass our transformed data into our pretrained model. Next, we take the logits only from the model output. Remember that we have classification head with number of output 3. So, for each inferred image we will have 3 logits score. Among these 3 score, we will take the maximum one and return its index using .argmax(). Finally, we plot our confusion matrix and print the accuracy and F1 score.

ViT confusion matrix on zero-shot scenario

Surprisingly, we got a unsatisfied metrics score with Accuracy: 0.329 and F1-Score: 0.307. OK, next let’s fine-tune our model for 3 epochs and test the performance again. Here, I used Kaggle environment to train model.

The code above was responsible to train our model. Note that we used 🤗 Huggingface Trainer instead of write our own training loop. Next, lets examine our Loss, Accuracy, and F1 Score for each epochs. You can also specify WandB or Tensorboard in Trainer parameter report_to for better logging interface. (Honestly, here I am using wandb for logging purpose. But for simplicity, I skipped the explanation of wandb part)

Model performances on each epochs

Impressive, isn’t it? Our ViT model already got very high performance since the first epoch, and changing quite steadily! Finally, let’s test again on the test data and later we plot our model prediction on few of our test data.

Here is our prediction scores on test data. Our finetuned model now has a very good performances compared to the one in zero-shot scenario. And among of 6 sampled test images, our model correctly predict all of them. Super! ✨

Conclusion

Finally, we reached the end of the article. To recap, we did quick review of the original paper of Vision Transformers (ViT). We also perform zero-shot and finetuning scenario to our pretrained model using publicly available Kaggle Shoe vs Sandals vs Boots dataset containing ~15K images. We examined that ViT performance on zero-shot scenario wasn’t so good, while after finetuning the performance boost up since the first epoch and changing steadily.

If you found this article is useful, please don’t forget to clap and follow me for more Data Science / Machine Learning contents. Also, if you found something wrong or interesting, please feel free to drop it in the comment or reach me out at Twitter or Linkedin.

In case you are interested to read more, follow our medium Data Folks Indonesia and don’t forget join us Jakarata AI Research on Discord!

Full codes are available on my Github repository, feel free to check it 🤗.

NB: If you are looking for deeper explanation especially if you want to reproduce the paper by yourself, you can check this amazing article by Aman Arora.

References

  1. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR.
  2. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In ICML, 2018.
  3. Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local relation networks for image recognition. In ICCV, 2019.
  4. Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self-attention in vision models. In NeurIPS, 2019.
  5. Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring self-attention for image recognition. In CVPR, 2020.
  6. Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In CVPR, 2018.
  7. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.

Let’s Get in Touch!

--

--

Affandy Fahrizain
Data Folks Indonesia

Machine Learning / NLP Enthusiast | Student @ITMO University, Russia