Comparing ViT and EfficientNet in terms of image classification problems

Roman S
Exness Tech Blog
Published in
12 min readOct 3, 2022

Aiming to create a filter on the quality of renovations for rental property ads, we conducted the following research and compared the two most popular architectures for solving computer vision problems.

Abstract

The problem of image classification is one of the most common and studied problems in computer vision. In computer vision in general, and in particular, in image classification, the concept of a convolutional neural network (CNN) has gained popularity and proved its effectiveness. New architectures for solving computer vision problems are also emerging, for example, transformers — neural networks using the attention mechanism — and their individual parts, encoders. The Transformer architecture has become the basis of state-of-the-art solutions for working with unstructured data and sequences in natural language processing and understanding.

Looking at the “Image Classification on ImageNet dataset” leaderboard by PapersWithCode, we can see that the top 10 solutions are based on the Transformer (Encoder-Decoder) or EfficientNet architecture. Despite the fact that the Transformer architecture significantly surpasses EfficientNet in terms of accuracy, the latter is still actively used due to its lower computing power requirements.

A variation of the Transformer architecture for solving computer vision problems is the architecture of the Vision Transformer (ViT) encoder model. In this article, we will compare EfficientNet and ViT in solving two image classification problems.

Business aims and objectives

When relocating to Cyprus, Exness employees often face the problem of finding an optimal property to rent. To help them with their housing search, we have developed a free service that aggregates ads from local rental websites and allows employees to apply various filters to them. One of these filters is on the quality of renovations. To create such a filter, we conducted the following research to solve the problems of computer vision related to the classification of images.

Literature overview

The task of image classification by renovation quality is neither new nor unique. The article “Image-based Renovation Progress Inspection with Deep Siamese Networks” describes various approaches to classifying the different stages of the apartment renovation process. Its authors use convolutional neural networks, ResNet, which is also present in the aforementioned leaderboard of neural networks for image classification. It is important to note that the authors separately consider the results of their model in relation to different rooms in apartments. In addition, the authors pay special attention to the time spent on calculations by the most effective approaches.

A number of blogs also describe approaches to classifying renovations by style, quality, and other subjective attributes. In particular, CIAN’s materials describe an approach to a binary classification of repair quality into “bad” or “good” using neural networks of EfficientNet and ResNet architectures of different sizes. The author comes to the conclusion that the use of EfficientNet is a priority, as it shows better results than ResNet. The author also points out that it is optimal to use a larger version of EfficientNet: EfficientNet_b4. Separately, the author draws attention to the difficulties of preparing a dataset for training models associated with the subjective perception of good and bad renovations. As a result, a set of criteria for “bad” (old) renovations was developed.

Experiments and data collection

To develop our own filter on the quality of renovations, we decided to compare the architectures of ViT and EfficientNet_b4 in terms of finding and determining the room in the image (since real estate rental ads often include photos of the external appearance of the building, the environment, as well as rooms that do not characterize the quality of repairs, for example, a balcony) and renovation quality classification for each of the selected photos. Comparison of the architectures of the selected models is based on accuracy, performance, and image processing speed.

Data collection

To train deep neural networks of selected architectures, it is necessary to prepare a large training dataset. For the problem of searching and determining the room in the image, we used the “Houses dataset”, in which images of a bedroom, bathroom, kitchen, and a frontal image of the house are presented for each property. We also used the Places365 dataset (with an image size of 256 x 256 pixels).

From the Places365 dataset, we selected images that correspond to the classes, which, according to our observations, may be present in rental ad photos (a detailed list of such classes may be found in Appendix A). We combined the Houses dataset and the selected images from the Places365 dataset. The total size of the resulting dataset was about 230,000 images, and 10% were moved to the test dataset.

Preparing data for classifying images by renovation quality is much more complicated. Annotation of the images by several employees showed a big difference in what people consider “good” or “poor” repair. Attempts to formulate a system of criteria for poor repair led to a complex multi-criteria tree-like decision-making system. Implementation of such a system should largely rely on the use of neural networks and detecting objects in images with their subsequent classification. Thus, we decided to prepare a dataset by extracting images from real estate rental and sale websites having filters that facilitate the selection of target images to avoid handy screening.

We extracted photos from several rental websites that provided the following filters:

  • Good repair: expensive 1- or 2-bedroom apartments, the author of the ad indicated the quality of the repair as “High” (“Designer”, “Euro”);
  • Poor repair: cheap 1- or 2-bedroom apartments, the author of the ad didn’t indicate the quality of the repair as “High” (any condition except “Designer”, “Euro”).

As a result, a balanced dataset containing about 8000 unique photos was collected, and 10% of them were moved to the test dataset. When building models, this dataset was additionally cleaned and augmented.

Models

As part of the experiments, we trained EfficientNet and Vision Transformer to solve two tasks:

  1. Classification of images into categories that may be present in the rental or sale ad photos;
  2. Binary classification of images into images of rooms representing “good” and “bad” repair. An important digression should be made here: constructing a model for classifying repairs without removing images where there is no room (for example, the exterior of a house, a view from a window, a corridor, etc.) will result in predictions close to random. We made the corresponding data cleaning in advance using a model solving the task (1).

For each of the tasks, we assembled a separate dataset, in accordance with the approach described earlier. For task (1), we did not use data augmentation. For task (2), we did augmentation at the stage of training models using the methods of crops, vertical reflections, and minor changes in brightness and saturation of the image. A more detailed description of each of the tasks, data preparation, and models is given in Appendix A.

EfficientNet is a family of convolutional neural networks for image classification. As part of our research, we used a pre-trained extended version of EfficientNet which is called EfficientNet_widese_b4. This version of the architecture is characterized by its large size and, consequently, a large number of training parameters in comparison with the basic version.

Vision Transformer (ViT) is a relatively new architecture for solving computer vision problems, based on the architecture of the transformer encoder. The main distinguishing features of this model are the division of an image into disjoint patches, the use of positional embeddings to reflect the sequence of patches in the image, as well as the use of the attention mechanism. As part of the solution to the tasks under consideration, the pre-trained ViT base architecture was used.

Model Overview

As an optimizer for EfficientNet, we used Adam with a learning rate = 1e-4. To optimize ViT, we used the SAM optimizer in the implementation for PyTorch with a learning rate = 5e-3.

The table below shows the results of the metric characterizing the learning process and the learning outcomes of each model on each task under consideration. The training was performed using graphics accelerators at a batch size of 4, and we used Accuracy on the test dataset for all models as a quality metric. Training took place until the convergence of the model on the test dataset, but no more than 5 epochs.

Task (1). Cross Entropy loss
Task (2). Binary Cross Entropy loss with 0.5 threshold

The tables above show that EfficientNet achieved an accuracy comparable to ViT on task (1), while on task (2) ViT significantly surpassed EfficientNet in terms of accuracy. Such a difference in solving task (2) may be because EfficientNet cannot catch local signs (features), focusing mainly on global ones, while dividing the image into patches when using ViT and the attention mechanism allow for both global and local signs (features) to be extracted simultaneously. As a result, EfficientNet is less stable to the noise created by data augmentation.

The result is not unexpected since ViT is more modern and significantly more complex architecture in comparison with EfficientNet. The obvious disadvantage of ViT is resource-intensive, long calculations. At the same time, if the EfficientNet architecture we use contains 33 million parameters, and ViT contains more than 80 million parameters, the inference time of ViT is higher than that of EfficientNet — one epoch of validation of ViT and EfficientNet on the same dataset took 14 and 14.5 minutes, respectively.

Two-headed ViT

Using the transformer encoder architecture in ViT allows you to further improve the performance of the model on inference by architecture engineering, in addition to classical quantization methods. Transformers are known for being able to solve several tasks simultaneously using a single model body and different heads. Even though models trained for individual tasks have relatively higher accuracy, using a single model body still allows you to achieve sufficiently high accuracy rates, and, at the same time, save resources. One of the most striking examples of using transformers to solve several tasks simultaneously is described in an article on the transformer model T5.

As part of our research, we decided to build a single encoder architecture model to solve both of our image classification tasks ((1) and (2)): a two-headed ViT that has one body and two heads. Schematically, the corresponding architecture can be represented as follows:

Two-headed ViT architecture

Such a model has shared weights in the ViT model part, and the heads of this model are trained separately, depending on a task being solved for the objects of each individual mini-batch at the training stage. It is important to note that for each task we have different datasets that do not overlap. To save training time, the experiment to build such a model relied on the use of un-augmented data to solve both the first (1) and second (2) tasks.

We built the learning process in such a way that we first thoroughly pre-trained the presented architecture with two heads, after that the weights of the ViT model were frozen and each head was retrained separately for its task. This allowed an additional slight increase in the accuracy for each task. SAM with a learning rate = 0.003 was used as an optimizer, as well as a step learning rate scheduler with a step of two epochs and gamma 0.1.

In the model input, we did not use the task prefix that is used in the T5 model or could be implemented in a multitask model by changing the embedding of the CLS token so that it includes an indication of the problem being solved. Thus, we got two answers in one run of calculations: which class does the image belong to, and whether the image shows a good or a bad repair.

The architecture of the heads also differs from the default head for image classification that is used in the model. While the head for image classification contains only one Linear layer, each of the heads in our model has two Linear layers, and ReLU activation is applied to the result of the first one. The results we achieved by using two layers are comparable in terms of accuracy to the ones achieved by training separate models.

We conducted some additional experiments to ensure a more honest comparison between our two-headed ViT model, which was trained without augmentation, and the EfficientNet. We measured accuracy on the augmented test dataset of both EfficientNet and two-headed ViT models trained without data augmentation (train dataset without augmentation and augmented test dataset). A summary of the results is shown in the table below:

Summary of the results

A dash in the table means that the corresponding experiment was not carried out. Time measurements, the number of parameters, and the weight of the model are given for each model under consideration. It is important to mention that the two-headed ViT model, unlike the other models listed above, allows us to solve both tasks simultaneously in one run.

As a result, we have tried several approaches with different models, and the final two-headed ViT model fits our accuracy and performance requirements and business needs in the best way.

Business application

After selecting the optimal threshold for the binary classification task, we conducted additional testing: we placed a web API with the resulting model for users to test. Below are examples of the classification of several photos of the interior spaces of an apartment:

Class label — 1 (Bathroom), renovation quality — 0.67 (with effective threshold = 0.48, where 1 is good renovation)
Class label — 1 (Bathroom), renovation quality — 0.12 (with effective threshold = 0.48, where 1 is good renovation)

Conclusion

The research we conducted was a part of the creation of a renovation quality filter for our real estate rental service. The filter will allow our employees to search for an apartment of a certain renovation quality. The backend of the filter is based on scoring images of different real estate ads with some computer vision models.

During our research, we have trained several models with different architectures to compare their accuracy and other performance metrics. After a comparison of “vanilla” models, we have created our custom multi-head version of ViT to reach both high accuracy and fast computations.

As a result, we have found out that the two-headed ViT model made it possible to achieve the most optimal characteristics in terms of resources, calculation time, and accuracy.

Appendix A. Detailed description of the tasks

Task (1) was to put images into classes that can be found in real estate ad photos. A general list of such classes:

‘apartment_building/outdoor’: 0,

‘bathroom’: 1,

‘bedroom’: 2,

‘kitchen’: 3,

‘balcony/interior’: 4,

‘corridor’: 5,

‘clean_room’: 6,

‘closet’: 7,

‘childs_room’: 8,

‘dining_room’: 9,

‘dressing_room’: 10,

‘elevator_lobby’: 11,

‘entrance_hall’: 12,

‘living_room’: 13,

‘lobby’: 14,

‘mezzanine’: 15,

‘parking_lot’: 16,

‘parking_garage/outdoor’: 17,

‘parking_garage/indoor’: 18,

‘storage_room’: 19,

‘television_room’: 20,

‘utility_room’: 21,

‘botanical_garden’: 22,

‘downtown’: 23,

‘driveway’: 24,

‘forest_road’: 25,

‘forest_path’: 26,

‘fountain’: 27,

‘ocean’: 28,

‘park’: 29,

‘picnic_area’: 30,

‘street’: 31,

‘swimming_pool/outdoor’: 32,

‘swimming_pool/indoor’: 33,

‘village’: 34

For one separate ViT trained in accordance with the described approach, the accuracy on the test dataset was about 80%.

The noisiest part of the confusion matrix for some presented classes on the test dataset

Task (2) was to classify bad and good repairs, which is not possible if a dataset contains images that correspond to the “street” or “park” classes. Therefore, we selected images that correspond to the following classes:

‘bathroom’: 1,

‘closet’: 7,

‘bedroom’: 2,

‘living_room’: 13,

‘television_room’: 20,

‘kitchen’: 3,

‘utility_room’: 21,

‘dining_room’: 9

The resulting dataset size was about 6000 images, 10% of which were selected for the test. Since the amount of data for solving this task is relatively small, augmentation is of particular importance. For augmentation, we used TenCrop with 3/4 image proportions, RandomPerspective, RandomRotation, and ColorJitter from Torchvision. We used augmentation separately for the train and test datasets, that is why the augmented images from the train dataset did not get into the first test. As a result of training with data augmentation, ViT was able to achieve an accuracy of about 98% with a threshold = 0.5.

At the same time, the classification accuracy of EfficientNet on the same dataset was only 63%. Later we decided to train EfficientNet on the same dataset, but without augmentation. In this case, EfficientNet achieved 96% accuracy.

Such a difference in the behavior of EfficientNet on augmented and non-augmented data may be justified by the insufficient capacity of the model to identify local features. EfficientNet can identify some global patterns, for example, the presence of large objects and individual colors that characterize poor repair, but augmentation distorts such global patterns. Since the specifics of data collection for the formation of a dataset of images with poor and good repair implied the selection of obviously good and obviously bad repair examples, it was assumed that it is the augmentation and attention of the model to local features in the image that will allow the model to classify objects located between the selected objects with an obviously bad and obviously good repair.

--

--