[Paper] Mixup: Beyond Empirical Risk Minimization (Image Classification)

Outperforms ERM Variants Using Networks DenseNet, ResNeXt, Pre-Activation ResNet, WRN, & ResNet

Sik-Ho Tsang

Published in

The Startup

4 min readNov 21, 2020

**mixup** (Image from https://blog.airlab.re.kr/2019/11/mixup)

In this story, mixup: Beyond Empirical Risk Minimization, by MIT and FAIR, is shortly presented. In this paper:

mixup trains a neural network on convex combinations of pairs of examples and their labels.
By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples.

This is a paper in 2018 ICLR with over 1000 citations. (Sik-Ho Tsang @ Medium)

Outline

Empirical Risk Minimization (ERM)
mixup
Experimental Results

1. Empirical Risk Minimization (ERM)

In supervised learning, we are interested in finding a function f that describes the relationship between a random feature vector X and a random target vector Y, which follow the joint distribution P(X, Y).
A loss function is defined that penalizes the differences between predictions f(x) and actual targets y.
Then, the average of the loss function is minimized over the data distribution P, also known as the expected risk:

Unfortunately, the distribution P is unknown in most practical situations.
Using the training data D, we may approximate P by the empirical distribution:

Using the empirical distribution P, we can now approximate the expected risk by the empirical risk:

Learning the function f by minimizing the above loss function is known as the Empirical Risk Minimization (ERM) principle.

While efficient to compute, the empirical risk monitors the behaviour of f only at a finite set of n examples.

2. mixup

The contribution of this paper is to propose a generic vicinal distribution, called mixup:

where λ~Beta(α,α), for α ∈ (0,∞).
Sampling from the mixup vicinal distribution produces virtual feature-target vectors:

where (xi, yi) and (xj, yj) are two feature-target vectors drawn at random from the training data, and λ ∈[0, 1]. The mixup hyper-parameter α controls the strength of interpolation between feature-target pairs, recovering the ERM principle as α → 0.

Left: The above codes show the few lines of code necessary to implement mixup.
Right: The figure shows that mixup leads to decision boundaries that transition linearly from class to class, providing a smoother estimate of uncertainty.
There are several findings for mixup:

First, in preliminary experiments we find that convex combinations of three or more examples with weights sampled from a Dirichlet distribution does not provide further gain, but increases the computation cost of mixup.
Second, our current implementation uses a single data loader to obtain one minibatch, and then mixup is applied to the same minibatch after random shuffling. We found this strategy works equally well, while reducing I/O requirements.
Third, interpolating only between inputs with equal label did not lead to the performance gains of mixup discussed in the sequel.

3. Experimental Results

3.1. ImageNet

Standard data augmentation practices are used: scale and aspect ratio distortions, random crops, and horizontal flips.
For mixup, we find that α ∈ [0.1, 0.4] leads to improved performance over ERM, whereas for large α, mixup leads to underfitting.
mixup consistently outperforms their ERM variants using different networks: ResNet and ResNeXt.

3.2. CIFAR

The models trained using mixup significantly outperform their analogues trained with ERM: Pre-Activation ResNet, WRN, and DenseNet.
Also, mixup and ERM converge at a similar speed to their best test errors.

It is found that mixup also improves the performance for speech data, reduces the memorization of corrupt labels, increases the robustness to adversarial examples, and stabilizes the training of generative adversarial networks. If interested, please feel free to read the paper.