Should you even bother with pre-trained vision models?

Published in

ResponsibleML

4 min readFeb 4, 2021

An introduction to representation learning and Visual Task Adaptation Benchmark.

Let’s say you need to solve a computer vision task. A natural choice is to leverage recent developments in deep learning and use some kind of neural network since they seem to handle visual tasks really well. There are two ways you can proceed with neural models:

train your model from a scratch or
use pre-trained model

So what should you do?

In a recent paper by Google Research, A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark, 18 different methods of representation learning were tested on a brand new benchmark, the titular VTAB.

Representation learning is a type of learning, where your goal is to learn a useful representation. What do I mean by representation? The best way to describe it is by using an example.

Imagine you are training typical CNN architecture, like ResNet50 or EfficientNet for a classification task. You could divide this network into two parts then: a backbone and a classification head. The classification head would be the last layer that returns class predictions. Backbone would be everything before the last layer. Representation is the penultimate layer, the output of the backbone before it’s taken by the classification head to actually classify input. This representation hopefully contains a lot of information about the input image, since it is all what a (very simple) classification head uses to classify. To sum up, representation is a kind of embedding — a way to represent input images as vectors of real numbers.

What I described in a previous paragraph was a representation obtained in a typical supervised learning paradigm, but it’s not the only way. It is believed that special kinds of training might result in better representations.

One family of those representations is called self-supervised. A good example is a technique introduced in Unsupervised Representation Learning by Predicting Image Rotations. It works in the following way: images in the dataset are randomly rotated by 0, 90, 180, or 270 degrees. The network is tasked with predicting what kind of rotation was applied to the source image. Therefore it boils down to 4-class classification task, where labels are generated on the go. Obviously, this task is made up, but then again, what we are after are representations learned along the way. There are many other kinds of representation learning techniques.

Rotation-pred, an example of a self-supervised learning algorithm. Source: Unsupervised Representation Learning by Predicting Image Rotations

Since all the terms are now hopefully understandable, I now can come back to the VTAB. As I said before, the paper focuses on 18 different representation learning algorithms and benchmarks them on 19 diverse vision tasks.

First they obtain different representations by applying different representation learning algorithms on ImageNet dataset.

Then the fine-tune all representations on all the vision tasks.

Finally they measure how well the fine-tuned representations performed. Here are some of their results:

Mean accuracy across all 19 tasks for each representation learning algorithm (fine-tuned on 1K training set and full training set for each task). Source: A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

As you can see, apart from generative models, training from scratch had worse performance in comparison to leveraging pre-trained representation— even if the source dataset was very different from the target task!

VTAB tasks are divided into three main categories: Natural, Specialized, and Structured. Natural tasks involve photos taken by typical cameras. Specialized tasks consist of images from unusual equipment, for example, medical imaging or satellite photos. Structured tasks use mostly computer-generated images. Below is a graph comparing results for each type of transfer learning algorithm and task category:

Mean accuracy across each task class (Natural, Specialized and Structured) and each representation learning algorithm type. Source: A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

Semi-supervised and supervised algorithms tend to perform the best. Keep in mind that all of the models were trained using the ImageNet dataset, which fits into the Natural category, yet there are clear improvements for all task types.

This might suggest, that using a pre-trained model is a good idea, even if it was trained on something vastly different.

Here is another question might arise: okay, we have compared different representation learning algorithms, but what about particular architectures? There are hundreds of pre-trained models, but which should you choose? What if ResNets result in great representations, but EfficientNets are rubbish? This is a part of ongoing research in our lab, so stay tuned!

If you are interested in other posts about explainable, fair, and responsible ML, follow #ResponsibleML on Medium

Should you even bother with pre-trained vision models?

Written by Michał Sokólski