Deep Learning Through the Lens of Example Difficulty — Paper Summary

Published in

ML Summaries

6 min readFeb 13, 2022

Link: https://openreview.net/forum?id=fmgYOUahK9
Authors: Robert John Nicholas Baldock, Hartmut Maennel, Behnam Neyshabur
Tags: Deep Learning, Example Difficulty, Dataset Difficulty, Curriculum learning
Code: —
Video: https://papertalk.org/papertalks/37152
Misc. info: Accepted to NeurIPS’21

What?

In this paper, the authors try to understand the “difficult” examples in the training. Difficult here means, given an example, how much model capacity is needed to fit it to a given label. The authors also try to understand what makes a given example difficult? Is the example ambiguous? Does it look like a different class? and so on..

Why?

Most of the deep learning research is focused on trying to understand the inductive biases of deep learning architectures or to propose new networks to improve the performance on benchmark tasks. We take the datasets for granted. This paper takes the dataset perspective and try to make some conclusions on different types of images present in the data.How?

How?

Experimental setup: For a given dataset (say CIFAR-10), the authors keep the network constant (say ResNet-18). The authors experimented on multiple datasets like CIFAR-10, CIFAR-100, FMNIST, SVHN, and with different architectures like ResNet-18, VGG16, and MLP. MLP consists of 7 hidden layers of width 2048 with ReLU activations.

Let’s define a few terms before we move on to the main results of the paper.

Example difficulty: There are many ways to define the difficulty of the example. In this paper, the authors define difficulty as the smallest depth beyond which the representation of a given example starts being classified in the correct class. The authors use a k-NN probe to do the classification at each layer.

Consistency score: For a given example, let’s say we train ‘n’ models with different initialization seeds and on train data excluding the example, consistency score is defined as % of models where the prediction is same as the assigned label.

Consensus class: For a given example, for ‘n’ trained models, the consensus class is given by the majority of the classifiers.

Consensus-Consistency score: For a given example, this is the % of models where the prediction is the same as the consensus label.

Prediction Entropy: For a given example, when we predict the class using ‘n’ models, we calculate the entropy on each of the class probabilities.

Example: To understand the above 3 concepts, let’s take a simple example. Let’s take a cat image, and we trained 1classifier with the given cat image in training and we do k-NN at different prediction depths, and we notice at depth 6 and beyond, the image representation is always classified as ‘cat’, so the example difficulty is 6 here.

Now let’s train 100 more classifiers with different initialization seeds, with this cat image omitted from the training data. Now, 70 classifiers predicted this image as ‘dog’, 30 classifiers predicted as ‘cat’. So consistency score is 0.3, and the consensus class is ‘dog’. And the consensus-consistency score would be 0.7.

The prediction entropy is -(0.7log_2 0.7 + 0.3log2 0.3) [since we got 0 predictions for rest of the classes.]

Main analyses and takeaways:

Some datasets are harder than other datasets, as in most examples fall in the higher depth ranges. CIFAR-100 is much harder than Fashion MNIST. (See Fig 1, left)
The correlation between prediction depths of a given dataset between 2 models if the architectures have similar inductive bias, i.e. ResNet and VGG are more correlated than VGG and MLP since the former pair are convolution architectures. (Fig 1, right)
For easy examples, the consistency can be either high or low, but the consensus-consistency is high! It implies, the model can fit these examples very easily, however, the prediction label might not match the ground-truth label in absence of a label. (Fig.2)
In hard examples (ones with higher prediction depth), some examples can be fit confidently to the correct label, however, some are hard to fit in absence of a label. (Fig.2)
The prediction depth can be a lower bound for consensus-consistency of all the examples. (i.e. prediction depth can give a rough estimate on the uncertainty of the example, but it is uncertainty around consensus class, not the true class) (Fig.2)
The easy examples are learned earlier in the process too (Fig 3; left)
Data with smaller prediction depths have both larger input and output margins on average and that variances of the input and output margins decrease as the prediction depth increases. (Fig 3; center, right) [ output margin is the difference between the largest and second-largest logits and the adversarial input margin is the smallest norm required for an adversarial perturbation in the input to change the model’s class prediction]
The prediction entropy is lowest for examples with lower prediction depth. (Fig 4)

Fig 1. (Left) The fraction of points at different prediction depths. Depending on the dataset the distribution of points change. (Right) How does the prediction depth of all the points correlate across different models when they are in train or validation splits. When the same model is compared with itself, they are initialized with different seeds.

Fig. 2. (Left) 250 ResNet-18 models trained with different train, validation splits. Each point in the plot is an example evaluated on all the models with the point in the validation split. (Center, Right) Condensed plots of points for different datasets. The size of the bubble indicates the number of points. Easy examples are more consistent and are more likely to be accurate.

Fig. 3 — (Left) Are the easy examples learned fast too? It seems to be true. The figure shows results on VGG (Center, Right) If we plot the margins of examples compared to prediction depths, they are correlated too. Easy examples seem to have higher margins. On VGG, CIFAR-10.

Fig.4 — When the uncertainty of predictions on an example is high, then the entropy is higher. Intuitively, most of the points are of lower entropy in FMNIST and are of higher entropy for CIFAR-100.

Why are some examples difficult?

The authors also try to understand the difficulty of examples by asking the following 3 questions.

“Does this example look mislabeled?”; “Is classifying this example only easy if the label is given?”; “Is this example ambiguous both with and without its label?”.

To understand this question, we need to look at example difficulty when the example is part of training and when it is not part of training. The following plots show, the average prediction depth on 250 ResNet-18 models (90:10 train, val split — 25 models where an example is missing, and 225 models where an example is present)

We can decompose the left plot into 2 plots where Consensus class =GT and when it’s not. Let’s examine the right plot. The points with high depth at both train and val splits are the most ambiguous ones, with or without labels they are hard to recognize. We see some examples from bird class.

When we look at examples where consensus class != GT, there are 2 types, (1) points easily fit in val but not in train — these are points which look like different class/ perhaps mislabeled [points closest to decision boundary perhaps], like in the above figure, the birds look very much like a plane. (2) points easily fit in train but not in val — these are ambiguous points without label. They are easy to fit with labels, however, they are hard to classify when similar points don't exist in training data.

Hence we can divide all points into ~4 categories [PD ->prediction depth]

Easy examples: (Low PDVal., Low PDTrain)
Looks like a different class: (Low PDVal., High PDTrain)
Ambiguous unless the label is given: (High PDVal., Low PDTrain)
Ambiguous: (High PDVal., High PDTrain)

Comments:

It is an interesting analysis. The paper essentially introduces a new difficulty metric for examples and the authors build a story around it and try to intuitively explain what the edge case points are.

The paper is a bit too dense though. There are a looooot of experiments and a lot of plots, but I wish authors cut down on analysis and spent more time on building intuition on why we are observing such trends.

Another concern I had is, we are calculating the consensus class from a set of models and are using the same models to calculate consensus-consistency. it somehow feels like cheating. But again, we are not working with binary class but with multi-class problems, so it might be ok...