[Hands-On] Head-based Image Classification with ViT

10 min readApr 14, 2024

This code is written for educational purposes.

(You can find the Korean version of the post at this link.)

This post is the second tutorial in our series on head-based classification techniques. In the previous post, we examined head-based classification in texts.

[Hands-On] Head-based Text Classification with BERT

Explore BERT for text classification with our tutorial on head-based methods, ideal for understanding and implementing…

medium.com

This post will delve into head-based classification in images, utilizing the Vision Transformer (ViT). We will start by downloading a pre-trained model using the Hugging Face transformers library and explore how head-based classification is applied. We will use a dataset containing images of fruits such as apples and cherries for this tutorial.

What is ViT?

The Vision Transformer (ViT) successfully applies the principles of transformers, widely used in NLP, to image classification tasks. Instead of processing images through pixels or convolutional features, ViT treats an image as a series of patches, applying a transformer model to these sequences for classification tasks. If you are familiar with BERT, you can think of ViT as an image version of BERT.

What is Head-based Classification?

[Hands-On] Head-based Image Classification with ViT

[Hands-On] Head-based Text Classification with BERT

Explore BERT for text classification with our tutorial on head-based methods, ideal for understanding and implementing…

What is ViT?

What is Head-based Classification?

Written by Hugman Sangkeun Jung