Mixup Training Procedure

moodayday™
AI³ | Theory, Practice, Business
2 min readSep 9, 2019

Today, let’s brush up one of the most influential ideas published in the field of AI in recent history (April 2018). The revolutionary concept introduced in this paper is that of mixup as a regularization technique.

For that matter, we are going to start from afar and make our way progressively to the clearer view of the promised land of superior performance that mixup boasts.

Empirical Risk Minimization

What is this thing again?

The core idea is that we cannot know exactly how well an algorithm will work in practice (the true “risk”) because we don’t know the true distribution of data that the algorithm will work on, but we can instead measure its performance on a known set of training data (the “empirical” risk).

This makes sense. We can’t tell how well our model will do in production before it actually goes in production.

Now the problem is that we need to find a way to maximize how well its performance will be. That’s where the mixup comes in to help us reduce overfitting.

What is overfitting?

Overfitting is simply the term used for when a trained model gets way too comfortable in handling the data that it’s trained on. Clearly, that increases the chance of such a model to perform poorly on other data in production.

Regularization technique?

These are just the techniques used to reduce overfitting.

What is the mixup initialization?

In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in between training examples. Our experiments on the ImageMet-2012, CIFAR-10, CIFAR-100, Google commands and UCI datasets show that mixup improves the generalization of state-of-art neural network architectures. We also find that mixup reduces the memorization of corrupt labels, increases the robustness to adversarial examples, and stabilizes the training of generative adversarial networks.

You could read more in a very interesting paper which introduced this powerful concept.

Let’s say we’re on CIFAR10 for instance, then instead of feeding the model the raw images, we take two (which could be in the same class or not) and do a linear combination of them: in terms of tensors, it’s

new_image = t * image1 + (1-t) * image2

where t is a float between 0 and 1. Then the target we assign to that image is the same combination of the original targets:

new_target = t * target1 + (1-t) * target2

assuming your targets are one-hot encoded (which isn’t the case in pytorch usually).

And that’s as simple as this!

--

--