FAST AI JOURNEY: COURSE V3. PART 2. LESSON 14.1.

Documenting my fast.ai journey: PAPER REVIEW. MIXUP: BEYOND EMPIRICAL RISK MINIMIZATION.

SUREN HARUTYUNYAN
6 min readMay 9, 2019

For the Lesson 14 Project, I decided to dive into the 2018 paper, named mixup: Beyond Empirical Risk Minimization, by Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz, which was was published as a conference paper at the 2018 ICLR. Particularly, we will focus on the the Section 2 of the paper. The authors have made the source code necessary to replicate their CIFAR-10 experiments here.

Our objective here is to understand Section 2, named From Empirical Risk Minimization to mixup, and describe it using the concepts we have learned so far.

1. From Empirical Risk Minimization To Mixup.

The authors first start by stating how in supervised learning we want to find a function fF that will capture the relationship between a feature vector (X) and target vector (Y). These vectors have a joint distribution P(X,Y).

So we need a loss function (l) that will penalize the difference between f(x) and (y), which is what we predict and the targets, for the examples (x, y) ∼ P.

Next, we have to minimize the mean of the loss function (l) over our distribution P, which is also called the expected risk, and has the following form:

Expected Risk. Source: https://arxiv.org/pdf/1710.09412.pdf.

Since in practical situations we don’t know the distribution P, we can use our training data D to approximate the P by the empirical distribution.

Recall that our training data D has the form.

Training Data Form. Source: https://arxiv.org/pdf/1710.09412.pdf.

And the empirical distribution takes the following form:

Empirical Distribution Form. Source: https://arxiv.org/pdf/1710.09412.pdf.

from which we know that the δ(x = x_i , y = y_i) is a Dirac mass that is centered around the points (x_i , y_i). If you want more information on the Dirac Delta and the Dirac Measure, follow this link and this one, respectively.

With the empirical distribution P_δ, we will be able to approximate the expected risk by the empirical risk, which will have the following form:

Equation 1. Empirical Risk Form. Source: https://arxiv.org/pdf/1710.09412.pdf.

What we have done here is learning the function (f) by minimizing the empirical risk, which is called the Empirical Risk Minimization (ERM) principle.

The ERM is efficient to compute, but observe that it only looks at the changes of (f) on a limited number of n examples.

As you may have guessed, this is not enough, when we are taking into account functions with a large number fo parameters, like neural networks. A simple method to minimize the empirical risk in these cases, would be to memorize our training data. This method has the caveat that it can lead to or function (f) be outside the training data.

Having said this, we must recall that our estimate of the empirical distribution P_δ is only one of choices available to us to approximate the true distribution P.

A principle called Vicinal Risk Minimization (VRM), and introduced by Chapelle et al. in 2000, approximates the distribution P in the following way:

Vicinal Risk Minimization. Source: https://arxiv.org/pdf/1710.09412.pdf.

where the term v is the vicinity distribution. It gives the probability of finding the virtual feature-target pair:

Feature-Target Pair. Source: https://arxiv.org/pdf/1710.09412.pdf.

in the vicinity of the training feature-target pair:

Training Feature-Target Pair. Source: https://arxiv.org/pdf/1710.09412.pdf.

In the original paper, Chapelle et al., used what is known as Gaussian vicinities, which takes the form:

Gaussian Vicinities. Source: https://arxiv.org/pdf/1710.09412.pdf.

The practical effect of which is equal to augment our training data with additive Gaussian noise. If we use VRM, we have to take a sample of the vicinal distribution, which will be used to create a dataset, called D_v, with the form:

Vicinal Distribution Form. Source: https://arxiv.org/pdf/1710.09412.pdf.

And after building the dataset we will minimize the empirical vicinal risk, which has the form:

Empirical Vicinal Risk Form. Source: https://arxiv.org/pdf/1710.09412.pdf.

In the paper, the authors want to propose a generic vicinal distribution, which they call mixup, which will take the following form:

Mixup Form. Source: https://arxiv.org/pdf/1710.09412.pdf.

where λ ∼ Beta(α, α), for α ∈ (0, ∞), i.e. our λ term is a Beta Distribution. For more information on Beta Distribution follow this link.

As we can see, if we sample from the mixup vicinal distribution we can obtain to virtual feature-target vectors, which take the following form:

Virtual Feature-Target Vectors. Source: https://arxiv.org/pdf/1710.09412.pdf.

Here our (x_i, y_i) and (x_j, y_j) are feature-target vectors that we randomly drew from our training dataset. Also, our λ ∈ [0, 1], i.e., takes values from 0 to 1.

Finally, the authors state that the hyper-parameters (α) is a way of controlling:

[…] the strength of interpolation between feature-target pairs, recovering the ERM principle as α → 0.

In Figure 1, we can see a an implementation of mixup in PyTorch.

Figure 1. Source: https://arxiv.org/pdf/1710.09412.pdf.

The authors also provide alternative designs:

  1. Convex combinations of 3 or more examples with the weights sampled from a Dirichlet Distribution does not provide further gain. Instead, it only but the computational overhead cost of mixup. For more information on the Dirchlet distribution follow this link.
  2. The current implementation uses a single data loader to obtain one minibatch, and then mixup is applied to that minibatch after random shuffling. The authors state that this strategy works well, while reducing the input and output requirements.
  3. Interpolating only between inputs with equal label does not lead to the performance gains of mixup. More empirical comparison can be found in Section 3.8 of the paper.

1.1. What Is Mixup Doing?

The authors interpret the mixup vicinal distribution as a data augmentation technique, which:

[…] encourages the model (f) to behave linearly in between training examples.

The authors state that this linear behavious is what reduces the unwanted variation when we are making predictions that are outside the training examples.

They also argue that, since the linearity is one of the simplest behaviours, it a good inductive bias from the Occam’s razor perspective.

In Figure 1b we can observe that if we are using mixup, decision boundaries will transition linearly from one class to another, which leads to a smoother estimate of uncertainty.

If we observe Figure 2, we can observe the behaviours of two neural networks that are trained on the CIFAR-10 dataset. One uses ERM, the other one mixup. Both models have the same architecture, were trained on with the same procedure, and the evaluation is performed at the same points in-between training data that was randomly sampled.

Figure 2. Source: https://arxiv.org/pdf/1710.09412.pdf.

We can clearly see that the model that was trained using mixup is more stable in terms of predictions and gradient norms in-between training samples.

--

--