[Hands-On] Head-based Image Classification with ViT

Hugman Sangkeun Jung
10 min readApr 14, 2024

This code is written for educational purposes.

(You can find the Korean version of the post at this link.)

This post is the second tutorial in our series on head-based classification techniques. In the previous post, we examined head-based classification in texts.

This post will delve into head-based classification in images, utilizing the Vision Transformer (ViT). We will start by downloading a pre-trained model using the Hugging Face transformers library and explore how head-based classification is applied. We will use a dataset containing images of fruits such as apples and cherries for this tutorial.

What is ViT?

The Vision Transformer (ViT) successfully applies the principles of transformers, widely used in NLP, to image classification tasks. Instead of processing images through pixels or convolutional features, ViT treats an image as a series of patches, applying a transformer model to these sequences for classification tasks. If you are familiar with BERT, you can think of ViT as an image version of BERT.

What is Head-based Classification?

--

--

Hugman Sangkeun Jung

Hugman Sangkeun Jung is a professor at Chungnam National University, with expertise in AI, machine learning, NLP, and medical decision support.