Review: Self-training with Noisy Student improves ImageNet classification — Series 1 of 3

Guan

Published in

工人智慧

13 min readFeb 15, 2021

Self-training with Noisy Student improves ImageNet classification

We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is…

arxiv.org

這篇是出自 Google 的 Self-training 系列第一篇，和 FixMatch 系列和 SimCLR 不同，這兩者相較之下都使用較小，或一部分的資料作為 Training Set，這系列的思想是用最大的資料集 + 最多的運算機器 + 最強的演算法挑戰 ImageNet SOTA。

接下來的兩篇分別會是
1. Rethinking Pre-training and Self-training
2. Meta Pseudo Labels

而 Meta Pseudo Labels 在近日成為新的 ImageNet SOTA，在了解它之前，讓我們先來看看它的前身，也就是這篇 Noisy Student。

Introduction

首先，這篇的初衷使它與眾不同：

By showing the models only labeled images, we limit ourselves from making use of unlabeled images available in much larger quantities to improve accuracy and robustness of SOTA models.

作為 Skynet 的可能候選人，Google 無疑是坐擁地球上最多資料的巨頭之一，該如何有效使用 unlabeled data 成了大問題，請工讀生來標記是一個選項，但 Google 也擁有最多的 TPU 以及強大的 EfficientNet，所以…

請 EfficientNet 來標記看看如何？也就是 teacher-student framework。

我個人認為這來自於三個直覺 :
1. 模型標注的比人快、更省成本
2. 對於非 ImageNet 的資料，人類很難準確分類到 ImageNet
3. 出自於模型的標記(猜測)，也許對於另一個模型更能夠理解

有些批評(註一)認為引入大量額外且未公開的JFT300 資料，成為 ImageNet SOTA 似乎沒什麼好說嘴的，因為非常符合直覺(但有時實現直覺也很困難)。但本文應該關注的價值是如何有效率的應用多而無用的 unlabeled data，更關注於資料層面，而不該以演算法眼光看待。

Noisy Student Training

演算法相當簡單，由一個事先訓練好的 Teacher Model 將 unlabeled data 標記後，和 labeled data 一起交給 Student Model 訓練。在 Student Model 表現超越 Teacher Model 後，將該 Student Model 作為老師訓練新學生，如此循序。

這種 teacher-student framework 並非本文首創，本文重點實際上是 Adding Noise :

In typical self-training with the teacher-student framework, noise injection to the student is not used by default, or the role of noise is not fully understood or justified.

而本文應用了三種 Noise，即
1. RandAugment (input noise)，僅應用在 input to student model
2. Dropout (model noise) for both models
3. Stochastic depth (model noise) (註二) for both models

RandAugment 不多作解釋，有意思的是後兩者。作者認為 dropout 及 stochastic depth 有助於 teacher model 形成 emsemble model，而不是以 regularization 的觀點理解。

when dropout and stochastic depth function are used as noise, the teacher behaves like an ensemble at inference time (when it generates pseudo labels), whereas the student behaves like a single model. In other words, the student is forced to mimic a more powerful ensemble model.

有點小疑惑的是，在 source code 中，teacher 和 student 帶有相同的 dropout ，故這邊指的 student behaves like a single model 並不是沒有使用 dropout 的意思。

此外，data filtering 和 data balancing 以及 soft pseudo labels 能達到較好的效果，至 ablation study 時再詳細說明。

Experiments

這個小節主要談論一些實驗細節：

使用公開資料集 YFCC100M 與 JFT300 結果比較，YFCC 可替代 JFT 。

2. 僅使用 Teacher Model Confi. > 0.3 的 unlabeled data ，即Data Filter，從 JFT 挑出 81M。並平衡 unlabeled data 的分類，重複挑選數量不達到 130K 的圖片，使其達到 130K，使 unlabeled data 總數從 81M 成長到 130M。

3. 使用 EfficientNet-B7 和 EfficientNet-L2，後者訓練時間是前者的 5 倍，且使用 2048 cores TPU 需要六天時間。

EfficientNet-L2, needs to be trained for 6 days on a Cloud TPU v3 Pod, which has 2048 cores, if the unlabeled batch size is 14x the labeled batch size.

EfficientNet-B7 and EfficientNet-L2 Details

4. Iterative training
先使用 EfficientNet-B7 作為老師，EfficientNet-L2 作為學生，並迭代。

5. ImageNet Results

6. Robustness Results on ImageNet-A, ImageNet-C and ImageNet-P (註三)
除了 Accuracy 提昇，Robustness 也是一大亮點。

ImageNet-A 主要由 ImageNet 中較困難的圖片組成，ImageNet-C 和 ImageNet-P 主要測試模型對於 Corruptions 的抵抗力。

ImageNet-A 達到 Top1 83.7% (ResNet50 0%, EfficientNet-L2 49.6%)
ImageNet-C 達到 Top1 77.8% (EfficientNet-L2 66.6%)
ImageNet-P 達到 Top1 86.4% (EfficientNet-L2 81.6%)
詳細數據請見原文 Table 3, Table 4, Table 5。

以下是 Noisy Student(黑色字體) 和 supervised EfficientNet-L2 (紅色字體)在三個資料集的比較。

ImageNet-A, C, P and EfficientNet Predicitons

另外，使用 FGSM attack，也就是在 gradient descent update 時，對每一個 pixel 的 gradient 都直接加上一個定值 epsilon，算是很殘忍的攻擊，但 Noisy Student 仍有非常強的抵抗力。

Ablation Study

The Importance of Noise in Self-training

先前提過，teacher-student framework 並非本文重點，而是 adding noise。由於學生是由老師的 pseudo labels 學習，當學生預測出的 labels 等於 pseudo labels ，CE loss 理當會為零

所以如果學生要超越老師，完美學習老師的答案也是不夠的，故 noise 勢必要扮演重要角色。

故在此，我們逐漸移除 RandAugment 和 dropout 以及 stochastic depth，再觀察訓練成效，如果前後存在差異，該差異就是這三種 noise 帶來的效益。

則 83.9% - 83.2% = 0.7 % 以及 85.1% - 84.3% = 0.8% 為 adding noise boost。

這裡有三個小細節：

(a) 在 noisy student w/o Aug, SD, Dropout 時，130M 的資料中，仍使 B5 從 84% 小幅提昇 84.3% ，作者把這歸咎於 SGD 的隨機性。

(b) 雖然沒有 iterative training ，但這提昇仍然較小(85.1% - 84%=1.1%)。原因可能是因為為了節省訓練時間，這裡使 labeled data : unlabeled data = 1: 1 而非 1:14 or 1:28。(Finding #7)

(c) 在 noisy student 的第三種狀況 teacher w. Aug, SD, Dropout ，source code 當中 teacher 原本就有使用 SD 和 Dropout 了，架構與 Student 一模一樣，考量到 Dropout 無法在訓練時使用而 Inference 時不使用，這應該是合理的，故 84.4% 與 85.1% 的差距應該僅是有無在 input image of teacher image 使用 Aug 而已。

2. Additional Ablation Study Summarization
共有七項實驗，這邊挑出比較特別的一項討論

Finding #8: Training the student from scratch is sometimes better than initializing the student with the teacher and the student initialized with the teacher still requires a large number of training epochs to perform well.

Finding #8 主要是討論在使用 Noisy Student 前，是否應該讓 teacher model 使用 unlabeled data ，讓 student model 訓練一段時間。

答案是否定的，應該直接進行 Noisy Student。
除了這樣的方式花費的時間和 Noisy Student 幾無二致之外，甚至有可能會使 student model 收斂在 local minimum：

For example, when we use EfficientNet-B7 with an accuracy of 86.4% as the teacher, the student model initialized with the teacher achieves an accuracy of 86.4% halfway through the training but gets stuck there when trained for 210 epochs, …

其他六項實驗都不錯，但大多前述有提及，礙於篇幅，詳細部份還是推薦各位至原文欣賞。

Conlusion

We found that self-training is a simple and effective algorithm to leverage unlabeled data at scale. We improved it by adding noise to the student, hence the name Noisy Student Training, to learn beyond the teacher’s knowledge

Noisy Student 的確是佳作，將 teacher-student framework 稍加改良就得到相當好的成效，維持簡潔優美的演算法。但問題是，它的應用場景是很難得的，先不說不斷迭代訓練模型，提昇 1% - 5% 是不是符合成本，它所需的資料量和運算能力也不是凡人所能想像的。

雖是如此，除去本文的場景，我認為它還能有許多符合凡人想像的應用，譬如用於 model compression 、model capacity research、adversarial training 等。

下一篇我們將談談由 Kaiming He 何恺明的 Rethinking imagenet pre-training 所啟發的 Rethinking Pre-training and Self-training。我們已知 ImageNet pre-training 並不一定能幫助訓練了，那 Self-training 能夠取代 pre-training 嗎？下回待續。

Reference

[1] Source Code of Noisy Student, credits to Google Research

Note

註一：reddit ，或是對 Meta Pseudo Labels 知乎

註二：Stochastic depth 中文理解為「隨機深度」，出自於 Deep Networks with Stochastic Depth，用於 ResNet 當中：產生 Random Binomial Variable (1 or 0)，乘以殘差部份，意即隨機加上殘差。

註三：ImageNet-A 是 ResNet50 無法準確分類的, ImageNet-C 和 ImageNet-P 都是帶有 Corruption 的圖片。詳細參考 ImageNet Dataset Advancements。