Self-training with Noisy Student

7 min readNov 16, 2019

2019 has been the year where a lot of research has been focused on designing efficient deep learning models, self-supervised learning, learning with a limited amount of data, new pruning strategies, etc. Although self-training isn’t something new, this latest paper from Google Brain team using this approach not only surpasses the top-1 ImageNet accuracy of SOTA models by 1%, it also shows that the robustness of a model also improves.

What is self-training?

Self-training is one of the simplest semi-supervised methods. The main idea is to find a way to augment the labeled dataset with the unlabeled dataset, after all, getting labeled data is very costly. Annotating data is mundane.

Self-training first uses the labeled data to train a good teacher model, then use the teacher model to label the unlabeled data. As we know that all the predictions of the teacher model on the unlabeled data can’t be good, hence in classical self-training, we select a subset of this unlabeled data by filtering out the predictions(aka pseudo labels) using a threshold for the score. This subset is now combined with the original labeled data and a new model, student model, is jointly trained on this combined data. This whole procedure can be repeated for n number of times until the convergence is reached.

Nice approach to make use of the unlabeled data. But the paper emphasizes on something else, a noisy student. What’s the deal with that? Is it different from the classical approach?

Yes, you are correct. The authors found out that for this method to work at scale, the student model should be noised during its training while the teacher model should not be noised during the generation of pseudo-labels. Because noise is an important piece (which we will be looking into detail in a minute) for this whole idea to work, that’s why they called it a noisy student.

The Algorithm

The algorithm is similar to classical self-training with some minor differences.

The main difference is the addition of noise to the student using different techniques like dropout, stochastic depth, and augmentation. It should be noted that the teacher is not noised when it generates the pseudo labels.

Are you mad? You are saying that just adding dropout, of course, which no one would have thought of (LOL), produced SOTA on ImageNet. And this paper is from Google Brain team, right?

Noise, when applied to unlabeled data, enforces smoothness in the decision function. Different kind of noise has a different kind of effects. For example, when augmentation is used, the model is forced to learn to categorize a normal and the corresponding augmented image in the same category. Similarly, when dropout is used, a model acts as an ensemble of models. Hence a student with noise is a more powerful model.

Apart from noise, two more things are very important for the noisy student to work well.

Even though the architecture of the teacher and student can be the same, the capacity of the student model should be higher. Why? Because it has to fit a much larger dataset (labeled as well as pseudo labeled). Hence the student model must be bigger than the teacher model.
Balanced data: The authors found out that the student model works well when the number of unlabeled images for each class is the same. I don’t see any specific reason for this except the fact that all classes in ImageNet have a similar number of labeled examples.

This is a high-level overview of what has to be done for the student model during training but there must be much finer details about the experiments carried out, right?

Experiment Details

We need to look into the details of the labeled data and, more importantly, about the unlabeled data. We will also take a look at all the architectures used for training the teacher and the student.

Labeled dataset

For labeled dataset, the very famous ImageNet dataset was used.

Unlabeled dataset

The unlabeled images were obtained from the JFT dataset which has around 300M images. Although these images are labeled in the original dataset, the labels were discarded for treating the images as the unlabeled dataset. To perform balancing and filtering on this dataset, EfficientNet-B0, originally trained on ImageNet, was used to predict the labels and only those images were considered for which the confidence score of the label was higher than 0.3. For each class, 130K samples were selected and for the classes that had less than 130K samples, some of the images were randomly duplicated.

Architecture

The authors used EfficientNets as their baseline models. Given that EfficientNets are much better than ResNets and have higher capacity, it makes sense to use them as baseline models.
The authors also scale up EfficientNet-B7 and obtain three different nets: EfficientNet-L0, L1, and L2.
EfficientNet-L0 is wider and deeper than EfficientNet-B7 but uses a lower resolution making the training speed for this net similar to B7.
EfficientNet-L1 is scaled up from EfficientNet-L0 by increasing the width.
Remember the idea of compound scaling introduced in EfficientNets? The authors used the same compound scaling technique to scale up EfficientNet-L1 to obtain EfficientNet-L2. EfficientNet-L2 is so large in terms of model size that the training time is ~5x the training time of EfficientNet-B7.

Training details

There is nothing much in the training details. Although the authors used a batch size of 2048 for training, they found out that the performance is the same when a batch size of 512, 1024 or 2048 is used. When the student model is larger than EfficientNet-B4, including EfficientNet-L0, l1, and L2, it is trained for 350 epochs and 700 epochs otherwise. The labeled and unlabeled datasets are concatenated and average cross-entropy loss is used.

Iterative Training

The best model is a result of iterative training. How is it performed?

The accuracy of EfficientNet-B7 is first improved by using it as both as a teacher model as well as a student model.
The improved EfficientNet-B7 is now used as a teacher and EfficientNet-L0 is used as a student model.
EfficientNet-L0 is now used as the teacher while EfficientNet-L1, which is wider than L0, is used as a student model.
EfficientNet-L1 is now used as a teacher and EfficientNet-L2, which is the largest model, is used as a student.
Efficient-L2 is now both used as a teacher as well as a student model.

Very well! Can we see some of the results as well?

Results

Hmm..the results are interesting but I have a counter-argument. You are increasing the capacity of models, and you are providing the models with more data as well to learn better. I doubt that adding “noise” to the student might not be necessary in this case. Did the authors confirm this by any chance?

That’s a very good question. The authors also carried out two different experiments in the same regard. Before we jump into that, let’s take a step back and think about the situation when the same model is used both as a teacher as well as a student. As the pseudo labels were generated using the same teacher model, a reasonable assumption is that the cross-entropy loss for the student model on the unlabeled data would be zero in this case. The training signal would vanish and the student model would eventually end up learning nothing new. This is where the noise hypothesis again kicks in. Noising the student ensures that the task is much harder for the student and it doesn’t merely learn the teacher’s knowledge.

Investigating further, the authors carried out two experiments on two unlabeled datasets of different sizes. In both the cases, the noise is gradually removed for the unlabeled images, while keeping it for the labeled images so that we can isolate the influence of noise on unlabeled data as well as prevent overfitting on labeled data. Here are the results:

We can see that as the noise is removed, the performance of the model drops consistently.

Conclusion

Overall this is a really good paper. Not only it shows that there is much to explore in self-training, it once again proves the importance of augmentation, dropout, and stochastic depth. There is only one thing that I didn’t like about this paper and I quote, from the paper itself:

“Our largest model, EfficientNet-L2, needs to be trained for 3.5 days on a Cloud TPU v3 Pod, which has 2048 cores.”

Given the amount of compute used for this, I think that many people won’t be able to replicate and tweak it for further experimentation.

I hope you enjoyed reading this summary!