# [Paper] Mixup: Beyond Empirical Risk Minimization (Image Classification)

## Outperforms ERM Variants Using Networks DenseNet, ResNeXt, Pre-Activation ResNet, WRN, & ResNet

In this story, mixup: Beyond Empirical Risk Minimization, by MIT and FAIR, is shortly presented. In this paper:

• mixup trains a neural network on convex combinations of pairs of examples and their labels.
• By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples.

This is a paper in 2018 ICLR with over 1000 citations. (Sik-Ho Tsang @ Medium)

# Outline

1. Empirical Risk Minimization (ERM)
2. mixup
3. Experimental Results

# 1. Empirical Risk Minimization (ERM)

• In supervised learning, we are interested in finding a function f that describes the relationship between a random feature vector X and a random target vector Y, which follow the joint distribution P(X, Y).
• A loss function is defined that penalizes the differences between predictions f(x) and actual targets y.
• Then, the average of the loss function is minimized over the data distribution P, also known as the expected risk:
• Unfortunately, the distribution P is unknown in most practical situations.
• Using the training data D, we may approximate P by the empirical distribution:
• Using the empirical distribution P, we can now approximate the expected risk by the empirical risk:
• Learning the function f by minimizing the above loss function is known as the Empirical Risk Minimization (ERM) principle.

While efficient to compute, the empirical risk monitors the behaviour of f only at a finite set of n examples.

# 2. mixup

• The contribution of this paper is to propose a generic vicinal distribution, called mixup:
• where λ~Beta(α,α), for α ∈ (0,∞).
• Sampling from the mixup vicinal distribution produces virtual feature-target vectors:
• where (xi, yi) and (xj, yj) are two feature-target vectors drawn at random from the training data, and λ ∈[0, 1]. The mixup hyper-parameter α controls the strength of interpolation between feature-target pairs, recovering the ERM principle as α → 0.
• Left: The above codes show the few lines of code necessary to implement mixup.
• Right: The figure shows that mixup leads to decision boundaries that transition linearly from class to class, providing a smoother estimate of uncertainty.
• There are several findings for mixup:
1. First, in preliminary experiments we find that convex combinations of three or more examples with weights sampled from a Dirichlet distribution does not provide further gain, but increases the computation cost of mixup.
2. Second, our current implementation uses a single data loader to obtain one minibatch, and then mixup is applied to the same minibatch after random shuffling. We found this strategy works equally well, while reducing I/O requirements.
3. Third, interpolating only between inputs with equal label did not lead to the performance gains of mixup discussed in the sequel.

# 3. Experimental Results

## 3.1. ImageNet

• Standard data augmentation practices are used: scale and aspect ratio distortions, random crops, and horizontal flips.
• For mixup, we find that α ∈ [0.1, 0.4] leads to improved performance over ERM, whereas for large α, mixup leads to underfitting.
• mixup consistently outperforms their ERM variants using different networks: ResNet and ResNeXt.