Rethinking Pre-training and Self-training

10 min readJun 28, 2020

In late 2018, researchers at FAIR published a paper Rethinking ImageNet Pre-training which was subsequently presented in ICCV2019. The paper presented some very interesting results regarding pre-training. I didn’t write a post about it then, but we had a long discussion on it in our KaggleNoobs slack. Researchers at Google Research and Brain team have come up with an extended version of the same concept. This new paper not only talks about pre-training but also investigates self-training and how it compares to pre-training and self-supervised learning for the same set of tasks.

Introduction

Before we dive into the details presented in the paper, let’s take a step back and discuss a few terms first. Pre-training is a very common practice across different domains like Computer Vision, NLP, and Speech. When we talk about computer vision, we expect a model, pre-trained on one dataset to help another. For example, supervised ImageNet pre-training is a widely used initialization method for object detection and segmentation models. Transfer learning and fine-tuning are two common techniques to achieve this.

Self-training, on the other hand, tries to improve the model’s performance. by incorporating models’ prediction on unlabeled data to obtain additional information that can be used during training. For example, using ImageNet to improve a COCO object detection model. The model is first trained on the COCO dataset. Then it is used to generate pseudo-labels for ImageNet ( we discard the original ImageNet labels). The pseudo-labeled ImageNet and labeled COCO data are then combined to train a new model.

Self-supervised learning is another popular pre-training technique. The aim of self-supervised learning isn’t just to learn high-level features. Instead, we want our model to learn better, more robust universal representations that work across a wider variety of tasks and datasets.

Enough chit-chat! Fancy definitions on one side but I still don’t understand what this has to do with the paper? Or are we here to learn the definitions?

Motivation

We have been using these techniques for a long time. The authors are motivated to find answers to these questions:

To what extent pre-training helps? When is pre-training not useful?
Can we use self-training instead of pre-training and get similar or better results compared to pre-training and self-supervised learning?
If self-training is superior to pre-training (for now assuming it is), to what degree it is better than pre-training?
What are the scenarios where self-training is better than pre-training?
How flexible and scalable self-training is?

Setup

Datasets and models

Object Detection: The authors used the COCO dataset (118K images) for Object detection for supervised learning. ImageNet (1.2M images) and OpenImages (1.7M images) were used as the unlabeled datasets. RetinaNet detector with EfficientNet-B7 as the backbone was used. The resolution of images was kept to 640 x 640, pyramid levels from P3-P7, and 9 anchors per pixels were used.
Semantic Segmentation: PASCAL VOC 2012 segmentation train set (1.5K images) was used for supervised learning. For self-training, the authors used augmented PASCAL dataset (9K images), COCO (240K labeled as well as unlabeled images) and ImageNet (1.2M images) datasets. NAS-FPN model with EfficientNet-B7 and EfficientNet-L2 as backbones was used.

For more details like batch size, learning rate, etc, please refer to section 3.2 in the paper.

Data Augmentation

Four different augmentation policies of increasing strength were used across all the experiments for both detection and segmentation. These four policies, in increasing order of strength, are:

Augment-S1: This is the standard Flip and Crop augmentation. The standard flip and crop policy consists of horizontal flips and scale jittering. Jittering operation can be random as well where we resize an image to (0.8, 1.2) of the target image size and then crop it.
Augment-S2: This consists of AutoAugment and Flips and Crops.
Augment-S3: It includes large Scale jittering, AutoAugment, Flips and Crops. The jittering scale was increased to (0.5, 2.0).
Augment-S4: A combination of RandAugment, Flips and Crops, and large scale jittering. The jittering scale here is the same as in Augment-S2/S3.

Pre-training

For studying the effectiveness of pre-training, ImageNet pre-trained checkpoints were used. EfficientNet-B7 is the architecture used for evaluation and for this model two different checkpoints were used. These are denoted as:

ImageNet: EfficientNet-B7 checkpoint trained with AutoAugment that achieves 84.5% top-1 accuracy on ImageNet.
ImageNet++: EfficientNet-B7 checkpoint trained with the Noisy Student method which utilizes an additional 300M unlabeled images and achieves 86.9% top-1 accuracy.

Training from a random initialization is denoted by Rand Init.

Self-training

The self-training implementation is based on Noisy Student and has it has three steps:

A teacher model is trained on the labeled data, e.g. COCO dataset.
The teacher model is then used to generate pseudo labels on the unlabeled data, e.g. ImageNet.
A student model is trained to optimize the loss on human labels and pseudo labels jointly.

Can we please look into some experiments now for God’s sake?

Experiments

Effects of augmentation and labeled dataset size on pre-training

The authors used ImageNet for supervised pre-training and vary the size of the labeled COCO dataset to study the effects of pre-training. Not only the size of the labeled data is varied, augmentation of different strengths is also used for training RetinaNet with EfficientNet-B7 as the backbone. The authors observed the following things:

Pre-training hurts performance when stronger data augmentation is used: The authors noticed that when they use standard augmentation, Augment-S1 as described above, pre-training helps. But as they increase the strength of the augmentation, pre-training doesn’t help much. In fact, they observed that when using the strongest augmentation (Augment-S3), pre-training actually hurts performance by a large amount.
More labeled data diminishes the value of pre-training: This isn’t a new finding. We all know that pre-training helps when we are in low data regime but if we have enough amount of labeled data, then training from scratch doesn’t yield poor performance. The authors found the same and this finding is consistent with the paper from FAIR.

The point regarding stronger augmentation and drop in performance is a pretty interesting finding. What do you think? Why is this happening?

My take: Most of the models trained on ImageNet don’t use such heavy augmentation. When you add heavy augmentations, the model may not converge properly. In fact, models can sometimes overfit a bit to certain augmentations, though this needs a proper detailed study.

Effects of augmentation and labeled dataset size on self-training

Now that we have seen the effects of pre-training, it’s time to check the results with the same task of interest (COCO object detection in this case), with the same model (RetinaNet detector with EfficientNet-B7 backbone) but this time with self-training. The authors used the ImageNet dataset for self-training (the labels for ImageNet are discarded in this case). The authors observed the following:

Self-training helps in high data/strong augmentation regimes, even when pre-training hurts: The authors found that when self-training is added to a randomly initialized model and heavy augmentation is used, it not only boosts the baseline results but also surpasses the results achieved with pre-training. Here are the results:

2. Self-training works across dataset sizes and is additive to pre-training: Another interesting aspect of self-training that the authors found is that it is additive to pre-training. Simply put, using self-training with a randomly initialized model or a pre-trained model always boosts the performance and the performance gain is consistent across different data regimes.

Wait a sec! When ImageNet++ init is used, the gain is small compared to the gain in Rand init and ImageNet init. Any specific reason?

Yes, ImageNet++ init is obtained from the checkpoint for which additional 300M unlabeled images were used.

Self-supervised pre-training vs self-training

We saw that supervised ImageNet pretraining hurts performance in highest data regime and strong data augmentations regime. But what about self-supervised pretraining? The primary goal of self-supervised learning, pre-training without labels, is to build universal representations that are transferable to a wider variety of tasks and datasets.

Hold on a sec! Let me guess this one. Because self-supervised learning learns better representations, it should be at least par with self-training if not better.

Sorry to disappoint you, the answer is NO. To investigate the effects of self-supervised learning, the authors used the full COCO dataset and strongest augmentation. The goal is to compare random initialization against a model pre-trained with a state-of-the-art self-supervised algorithm. The checkpoint for SimCLR, before it was fine-tuned on ImageNet, were used in this experiment. Because SimCLR only uses ResNet50, the backbone of the RetinaNet detector was replaced with ResNet50. Here are the results:

Even in this case, we observe that self-supervised pre-training hurts performance but self-training still boosts it.

What did we learn?

Pre-training and universal feature representations

We saw that pre-training (supervised as well as self-supervised) doesn’t always lead to better performance. In fact, it always underperforms compared to self-training. Why is that? Why ImageNet pre-training isn’t that good for COCO object detection? Why representations learned via self-supervised pre-training failed to provide a boost to the the performance?

In my opinion, most of the computer vision researchers already have this intuition, which again is pointed out by the authors: Pre-training isn’t aware of the task of the interest and can fail to adapt.

Think of ImageNet, it’s a classification problem which is much easier than object detection problem. Does the pretrained network, for the classification task, learn all the information needed for localization task? Here is how I like to put this: Different tasks need different level of granularity even if the tasks are a subset of each other.

Joint-training

As the authors pointed out that one of the strengths of the self-training paradigm is that it jointly trains the supervised and self-training objectives, thereby addressing the mismatch between them. We can always argue that instead of looking at some other technique to address the mismatch of the difference between the task, why can’t we train jointly, e.g. example, training ImageNet and COCO jointly?

The authors used the same setup as in self-training for this experiment and found out that ImageNet pre-training yields a +2.6AP improvement, but using a random initialization and joint-training gives a bigger gain of +2.9AP. Moreover, pre-training, joint-training, and self-training are all additive. Using the same ImageNet data source, ImageNet pre-training gets a +2.6AP improvement, pre-training+ joint-training gets +0.7AP improvement, and doing pre-training + joint-training + self-training achieves a +3.3AP improvement.

The importance of task alignment

As we saw above that task alignment is important for the performance boost. Similar findings were reported in this paper that pre-training on Open Images hurts performance on COCO, despite both of them being annotated with bounding boxes. This means that not only we want the task to be the same but the annotations to be the same for pre-training to be really beneficial. The authors noted two more interesting things:

ImageNet pre-training, even with additional human labels, performs worse than self-training.
With strong data augmentation (Augment-S4), training with PASCAL (train+aug datasets), actually hurts accuracy. Meanwhile, pseudo labels generated by self-training on the same dataset improves accuracy.

Scalability, generality and flexibility of self-training

From all the experiments the authors conducted, we can conclude that:

In terms of flexibility, self-training works well in every setup: low data regime, high data regime, weak data augmentation, and strong data augmentation.
Self-training is not architecture-dependent or dataset dependent. It works well with different architectures like ResNets, EfficientNets, SpineNet, etc. as well as with different datasets like ImageNet, COCO, PASCAL, etc.
In terms of generality, self-training works well when pre-training fails but also when pre-training succeeds.
In terms of scalability, self-training proves to perform well as we have more labeled data and better models.

This is good. Some of the points listed here raise a lot of questions on how we all have been using pre-training. But anything that has pros, comes with cons as well. You must be hiding some important point, right?

Limitations of self-training

Although self-training provides benefits, it has a few limitations as well.

Self-training requires more compute than fine-tuning on a pre-trained model.
The speedup from pre-training ranges from 1.3x to 8x, depending on the pre-trained model quality, strength of data augmentation, and dataset size.
Self-training isn’t a complete replacement for transfer learning and fine-tuning. Both these techniques will be heavily used in the future as well.

Conclusion

In my opinion, this paper raises a lot of fundamental questions regarding pre-training, joint-training, task awareness and universal representations. Solving these questions is far more important than building models with billions of parameters. Working on problems like this can help us gain better intuitions for understanding the decisions made by Deep Neural Networks.