Overfitting and probability calibration are two issues that arise when training deep learning models. There are a lot of regularization techniques in deep learning to address overfitting; weight decay, early stopping, dropout are some of the most popular ones. On the other hand, Platt’s scaling and isotonic regression are used for model calibration. But is there one method that fights both overfitting and over-confidence?
Label smoothing is a regularization technique that perturbates the target variable, to make the model less certain of its predictions. It is viewed as a regularization technique because it restrains the largest logits fed into the softmax function from becoming much bigger than the rest. Moreover, the resulted model is better calibrated as a side-effect.
In this story, we define label smoothing, implement a cross-entropy loss function that uses this technique and put it to the test. If you want to read more about model calibration please refer to the story below.
The why, when and how of model calibration for classification tasks
Imagine that we have a multiclass classification problem. In such problems, the target variable is usually a one-hot vector, where we have
1 in the position of the correct class and
0s everywhere else.
Label smoothing changes the target vector by a small amount
ε. Thus, instead of asking our model to predict
1 for the right class, we ask it to predict
1-ε for the correct class and
ε for all the others. So, the cross-entropy loss function with label smoothing is transformed into the formula below.
In this formula,
ce(x) denotes the standard cross-entropy loss of
ε is a small positive number,
i is the correct class and
N is the number of classes.
Intuitively, label smoothing restraints the logit value for the correct class to be closer to the logit values for other classes. In such way, it is used as a regularization technique and a method to fight model over-confidence.
The implementation of a label smoothing cross-entropy loss function in PyTorch is pretty straightforward. First, let us use a helper function that computes a linear combination between two values:
Next, we implement a new loss function as a PyTorch
We can now drop this class as is in our code. For this example, we use the standard fast.ai pets example.
We transform the data into a format ready to be used by the model, choose a ResNet architecture and aim to optimize the label smoothing cross-entropy loss. After four epochs the results are summarized below.
We get an error rate of 7.5%, which is more than acceptable for ten or so lines of code, where, for the most part, we use the default settings.
There are many things that we could tweak to make our model perform better. different optimizers, hyper-parameters, model architectures, etc. For instance, you can read how to take the ResNet architecture a bit further in the story below.
xResNet From Scratch in Pytorch
Squeeze a little extra from your ResNet architecture.
In this story, we saw what label smoothing is, when to use it and how to implement it in PyTorch. We then trained a state-of-the-art computer vision model to recognize different breeds of cats and dogs in ten lines of code.
Model regularization and calibration are two important concepts. Having a better understanding of the tools that combat variance and over-confidence will make you a better deep learning practitioner.
My name is Dimitris Poulopoulos and I’m a machine learning researcher at BigDataStack and PhD(c) at the University of Piraeus, Greece. I have worked on designing and implementing AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA. If you are interested in reading more posts about Machine Learning, Deep Learning and Data Science, follow me on Medium, LinkedIn or @james2pl on twitter.