Hugging face transformer: A game changer in Image classification

Jigisha Barbhaya
The Power of AI
Published in
4 min readJan 13, 2023

In this blog post, we will delve into the world of cutting-edge image classification techniques using the powerful and efficient Hugging Face’s Vision Transformer Model. This pre-trained model is a game changer for image classification tasks and can classify images with mind-blowing accuracy. Fine-tuning this model on a specific dataset can give you state-of-the-art results, outperforming traditional methods like convolutional neural networks.

This post will provide you with an overview. Make sure you check out the associated guided project in which we’ll guide you through the process of implementing this model to classify cartoon characters and beans with Python, making it easy for anyone to start using it in their projects. Furthermore, we will also explore the architecture of the Vision Transformer and its performance evaluation method, which will give you a full understanding of this powerful tool.

This guided project is a must-read for anyone looking to take their image classification skills to the next level and achieve unparalleled results.

Suppose you have different cartoon images of Mr. Beans, Tom, Jerry, and Mickey Mouse, and you have to classify them with labels as shown below with the help of computer science.

In the above scenario, you can use deep neural network models like CNN (convolutional neural network) and other classification methods, but it is a very time-consuming and traditional method, so let’s try something new, which is Hugging Face’s Vision Transformer Model, which is trained on enough data so it will perform admirably, outperforming a comparable state-of-the-art CNN with four times less computational power.

So, let’s go over the Vision Transformer Model in greater detail using another example of bean image classification. Here we are not talking about Mr. Beans, but about beans and leaves.

Beans Image Classification

The quality of beans differs from each other based on the geographic locations of their sources. The bean’s quality is conventionally determined by visual inspection, which is subjective, requires considerable effort and time, and is prone to error. This calls for the development of an alternative method that is precise, non-destructive, and objective.

The characteristics of currency are durability, portability, divisibility, uniformity, limited supply, and acceptability; all these describe beans. Suppose you are the founder of a crypto company called BeanStock that uses beans to back up crypto tokens. The token has exploded in popularity, so you need different beans for different tokens.

An example of Bean data is shown in the image below.

How hugging face model works?

The base classes PreTrainedModel, TFPreTrainedModel, and FlaxPreTrainedModel implement the common methods for loading/saving a model either from a local file or directory, or from a pre-trained model configuration provided by the library (downloaded from HuggingFace’s AWS S3 repository).

PreTrainedModel and TFPreTrainedModel also implement a few methods which are common among all the models to:

  • resize the input token embeddings when new tokens are added to the vocabulary
  • prune the attention of the model.

Vision Transformer

The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pre-trained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224. We can use the Vision Transformer Model for binary classification as well as multi-class image classification.

Architecture of Vision Transformer

Vision Transformer (ViT), a vision model that is based as closely as possible on the Transformer architecture that was initially created for text-based activities, as a first step in this direction. ViT directly predicts the class labels for an input image by representing it as a series of image patches, analogous to the series of word embeddings used when applying Transformers to text.

How it works?

from transformers import ViTFeatureExtractor, ViTForImageClassification
feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16–224-in21k")
from transformers import ViTForImageClassification
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16–224-in21k',num_labels=2,id2label= id2label,label2id= label2id)

Evaluation of Model

Here we are using Confusion Matrix which is specific table layout that allows visualisation of the performance of an algorithm.

Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa.


cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(xticks_rotation=45)

Results

The Vision of Hugging Face Transformers are already trained on a large amount of data, so one of the major advantages of using Hugging Face’s tools is that you can reduce the training time, resources, and environmental impact of creating and training a model from scratch. If you want to create your own classification model using hugging face, you can refer to this guided project.

--

--