[Hands-On] Head-based Image Classification with ViT
This code is written for educational purposes.
(You can find the Korean version of the post at this link.)
This post is the second tutorial in our series on head-based classification techniques. In the previous post, we examined head-based classification in texts.
This post will delve into head-based classification in images, utilizing the Vision Transformer (ViT). We will start by downloading a pre-trained model using the Hugging Face transformers library and explore how head-based classification is applied. We will use a dataset containing images of fruits such as apples and cherries for this tutorial.
What is ViT?
The Vision Transformer (ViT) successfully applies the principles of transformers, widely used in NLP, to image classification tasks. Instead of processing images through pixels or convolutional features, ViT treats an image as a series of patches, applying a transformer model to these sequences for classification tasks. If you are familiar with BERT, you can think of ViT as an image version of BERT.