Review — Rethinking ImageNet Pre-training (Object Detection, Semantic Segmentation)

Training From Scratch Not Worse Than ImageNet Pre-Training

Sik-Ho Tsang
Nerd For Tech
Published in
4 min readFeb 21, 2021


The model, ResNet50-FPN Using GN, trained from random initialization needs more iterations to converge, but converges to a solution that is no worse than the fine-tuning counterpart.

In this story, Rethinking ImageNet Pre-training, by Facebook AI Research (FAIR), is briefly reviewed.

Pre-training have been used over training from scratch for many papers. However, is the pre-trained knowledge really useful when transferred to other computer vision tasks?

In this story, some facts are discovered:

  • Training from random initialization is surprisingly robust, the results hold even when: (i) using only 10% of the training data, (ii) for deeper and wider models, and (iii) for multiple tasks and metrics.
  • ImageNet pre-training speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy.

This is a paper in 2019 ICCV with over 350 citations. (Sik-Ho Tsang @ Medium)

(There are many details on the experimental setup to make the experiment fair. I would skip some of the details and results to make the story short. If interested, please free feel to visit the paper.)


  1. Number of Training Images & Setup
  2. Training from Scratch to Match Accuracy
  3. Training from Scratch with Less Data
  4. Discussions

1. Number of Training Images & Setup

1.1. Number of Training Images Involved

Total numbers of images, instances, and pixels seen during all training iterations, for pre-training + fine-tuning (green bars) vs. from random initialization (purple bars).
  • Typical ImageNet pre-training involves over one million images iterated for one hundred epochs. In addition to any semantic information learned from this large-scale data, the pre-training model has also learned low-level features.
  • On the other hand, when training from scratch the model has to learn low- and high-level semantics, so more iterations may be necessary for it to converge well.
  • As shown above, if counting image-level samples, the from-scratch case sees considerably fewer samples than its fine-tuning counterpart.
  • Actually, the sample numbers only get closer if we count pixel-level samples.

1.2. Setup

  • Mask R-CNN with ResNet, and ResNeXt plus Feature Pyramid Network (FPN) backbones are used.
  • GN/SyncBN is used to replace all ‘frozen BN’. SyncBN means using BN under multiple GPUs.
  • The models are fine-tuned with 90k iterations (namely, ‘1× schedule’) or 180k iterations (‘2× schedule’) to a so-called ‘ schedule’ which has 540k iterations.

2. Training from Scratch to Match Accuracy

Learning curves of APbbox on COCO val2017 using Mask R-CNN with R101-FPN and GN
  • Typical fine-tuning schedules (2×) work well for the models with pre-training to converge to near optimum. But these schedules are not enough for models trained from scratch, and they appear to be inferior if they are only trained for a short period.

Models trained from scratch can catch up with their fine-tuning counterparts, if a 5× or 6× schedule is used. When they converge to an optimum, their detection AP is no worse than their fine-tuning counterparts.

3. Training from Scratch with Less Data

Training with 10k COCO images
  • Smaller training set of 10k COCO images (i.e., less than 1/10th of the full COCO set) is used.
  • The model with pre-training reaches 26.0 AP with 60k iterations, but has a slight degradation when training more.

The counterpart model trained from scratch has 25.9 AP at 220k iterations, which is comparably accurate.

4. Discussions

  • The above experiments also bring the below discussions by authors.

4.1. Is ImageNet pre-training necessary?

  • No, if we have enough target data.
  • This suggests that collecting annotations of target data (instead of pretraining data) can be more useful for improving the target task performance.

4.2. Is ImageNet Useful?

  • Yes.
  • ImageNet pre-training reduces research cycles, leading to easier access to encouraging results, and fine-tuning from pretrained weights converges faster than from scratch.

4.3. Is Big Data Helpful?

  • Yes.
  • But a generic large-scale, classification-level pre-training set is not ideal if we take into account the extra effort of collecting and cleaning data.
  • If the gain of large-scale classification-level pre-training becomes exponentially diminishing, it would be more effective to collect data in the target domain.

4.4. Shall We Pursuit Universal Representations?

  • Yes.
  • Authors believe learning universal representations is a laudable goal.
  • The study suggests that the community should be more careful when evaluating pre-trained features.



Sik-Ho Tsang
Nerd For Tech

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.