When Does Label Smoothing Help?

6 min readJul 25, 2019

In late 2015, a team at Google came up with a paper “Rethinking the Inception Architecture for Computer Vision” where they introduced a new technique for robust modelling. This technique was termed as “Label Smoothing”. Since then this technique has been used in many state-of-the art-models including image classification, language translation and speech recognition. Despite its widespread usage, label smoothing is poorly understood and it is hard to answer why and when label smoothing works. This recent paper from Google Brain team tries to demystify it by observing the changes in the representations learned by the penultimate layer of the network when trained with label smoothing.

What is Label Smoothing?

Label smoothing, in a nutshell, is a way to make our model more robust so that it generalizes well. For example, for a multi-class classification problem, we can write the predictions of a neural network as a function of the activations in the penultimate layer as:

Predictions as a function of activations in penultimate layer

p_k: Likelihood the model assigns to the k-th class
w_k: Weights and biases of the last layer
x: Vector containing the activations of the penultimate layer

We then use the cross-entropy as the loss function and try to minimize it to maximize the log-likelihood.

Fine, fine! What’s the problem with this?
The problem is the hard targets. The model has to produce large logit value for the correct label. It encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient reduces the ability of the model to adapt, resulting in a model too confident about its predictions. This, in turn, can lead to over-fitting.

So you are saying that hard targets are bad?
It depends! If your motive is to maximize the likelihood only, then hard targets aren’t bad to have and we have seen that it works as well. But if your motive is to build a robust model that can generalize well, then yes!

Let me guess, label smoothing will solve this, right? Show me how!
Yes! We introduce a smoothing parameter α and modify the targets like this:

Applying label smoothing to hard targets

Now, instead of minimizing cross-entropy with hard targets, yk, we minimize it using these soft targets.

What happens when you apply label smoothing?

Remember what happens when we train a network with hard targets? As we discussed above as well, the logit of the correct class is much larger than any of the incorrect logit. Not only this, the incorrect logits are very different from one other. Training a network with label smoothing helps to avoid these two problems. How?

It encourages the difference between the logit of the correct class and the logits of incorrect classes to be a constant dependent on α.
It encourages the activations of the penultimate layer to be close to the template of the correct class and equally distant to the templates of the incorrect classes.

In order to prove this property, the authors proposed a visualization scheme that consists of the following steps:

Pick three classes
Find an orthonormal basis of the plane crossing the templates of these three classes.
Project the activations of the penultimate layer from these three classes onto this plane.

So much of jibber-jabber! Show me the results, sorry, “visualization”..huh!

If you look at the above visualization carefully, you will observe that:

When label smoothing is applied, the clusters are much tighter because label smoothing encourages that each example in the training set is to be equidistant from all other class’s templates.
With hard targets, the clusters for semantically similar classes (for example different breed of dogs in ImageNet), are isotropic whereas, with label smoothing, clusters lie in an arc as shown in the third row. If you mix two semantically similar classes with a third semantically different class, the clusters are still much better than the ones obtained with hard targets as shown in the fourth row.

This makes sense and looks good but I am looking at the paper and I can’t see a huge difference in accuracy when a network is trained with soft targets. Were you just wasting my “valuable” time by telling me this story?

Agreed but we made this clear in the introduction itself that label smoothing helps to make the model robust so that it generalizes well and don’t overfit to the training set. But is that always true? Before we discuss that, let us move forward to another important benefit of label smoothing.

Implicit Model Calibration

It has already been proved in this paper that modern neural networks are poorly calibrated and over-confident despite having better performance than better-calibrated models from the past. Expected Calibration Error(ECE) was used to demonstrate this in the original paper. To reduce ECE, we generally use Temperature Scaling, a method where logits are scaled before applying softmax.

Now here comes the interesting part. If we apply label smoothing, we don’t need Temperature Scaling for calibration. With label smoothing, the models tend to be self-calibrated. Of course, you have to search for the optimal value of α, a default value of 0.1 works very well in most of the cases.

Hmm, this is a good insight and makes sense(..unlike you). The need for calibration can be very task-specific. We calibrate if calibration directly impacts the metric we are optimizing. In images, this isn’t the case most of the time. Can you give me an example where calibration would play an important role and how label smoothing helps in that?

That’s a good point. Consider language translation, where the network’s outputs are inputs to a second algorithm, beam search, which is affected by calibration. Beam Search approximates a maximum-likelihood sequence detection algorithm. If the model is properly calibrated, it will predict the next token with better accuracy. Hence we expect improved performance. The authors performed certain experiments and the results look like this:

I am convinced! But I have a question for you. We have two things now: temperature scaling and label smoothing. Can I tune both? I expect that it would result in superior performance. Ha!

When we apply label smoothing, the model is automatically calibrated. The authors found out that using temperature scaling degrades both the calibration and the BELU score.

Okay. This was an important point. I would have wasted days in this experiment. All the things that you have explained are in favor of label smoothing but there should be some catch as well, right? Tell me!

You are right. There is one specific case where label smoothing fails. It performs worse in that case as compared to hard labels and the case is Knowledge Distillation. The authors found out that distillation produces much worse student if the teacher is trained with label smoothing. Check the results:

The reason is the erasure of the relative information between the logits when a teacher is trained with label smoothing. As label smoothing encourages examples to lie in a tightly equally separated cluster, as shown above in Fig1, every example of one class has very similar proximities to examples of the other class. This is not the case when the targets are hard. So a teacher trained with label smoothing may have better accuracy but it doesn’t necessarily mean the teacher would distill better.

Conclusion

Overall this is a very good paper. Very well written and provide meaningful insights about label smoothing that weren’t explored earlier. Despite having a positive effect on generalization and calibration, label smoothing can hurt distillation.

Implementing and experimenting is one thing, what really matters is whether you understand the why and when.

When Does Label Smoothing Help?

What is Label Smoothing?

What happens when you apply label smoothing?

Implicit Model Calibration

Conclusion

References

Written by Aakash Nain