“Do CIFAR-10 Classifiers Generalize to CIFAR-10?” — On adaptivity in machine learning

Gal Yona
Comet
Published in
4 min readAug 7, 2018

What is adaptive data analysis? The crux of statistical learning theory is about guaranteeing generalization, i.e a small gap between accuracy with respect to the true data distribution (which is what we truly want to optimize) and training accuracy (which is what we optimize in practice). A crucial assumption in these generalization theorems, however, is that the choice of the hypothesis class — the class from which you will eventually return a function — is formulated before the data is collected and the machine learning takes place. This is known as the non-adaptive setting. In the adaptive setting, the hypothesis class is chosen after the data has been collected. In this setting, from a generalization perspective, all bets are off. In fact, there are extreme examples that show that in the adaptive setting one can overfit incredibly quickly.

The reason we should be concerned about this is that in the current machine learning landscape, adaptivity is all around us. The same few datasets are used repeatedly, with new hypotheses tested on them: researchers observe reported results in papers, and use these to inform future choices regarding models, architectures and so forth.

In particular, one dataset is heavily exploited: CIFAR10. The CIFAR10 dataset is a labeled subset of the 80 million tiny images. Most deep learning models are tested for vision applications, and the CIFAR10 image dataset serves as the main “benchmarking” dataset. Why? Because it’s slightly more difficult than MNIST, but doesn’t require huge compute like ImageNet. What makes things even worse is that CIFAR10's test-set remains fixed (i.e, the same test-train split is used), and it is fairly small (roughly 10K examples).

This tells us that we should at the very least be concerned. But does this actually happen in practice? Did five years of “hammering” the CIFAR10 test-set really destroy generalization? In other words: are all the recent improvements reported on this dataset (with a record 97% accuracy by ShakeShake) a result of pure overfitting to this one dataset?

This question is the motivation behind “Do CIFAR-10 Classifiers Generalize to CIFAR-10?”, a recent paper by Recht et al. The only way to verify the above hypothesis is to get a truly “clean” test-set for CIFAR10 — namely, one that none of the previous methods used before. Luckily, this is somewhat possible because CIFAR10 was actually derived from a larger, and significantly noisier dataset, TinyImages. The protocol for creating CIFAR from TinyImages was well documented in the original paper. With this information at hand, the authors of the paper manually tagged another 2000 images from TinyImages, while attempting to maintain their distribution (D’) as close to the original CIFAR10 distribution (D). The named this dataset CIFAR10.1. Why did the authors only tag 2K images? Apparently, it’s a large enough number to guarantee reasonable confidence intervals, but not too big as to risk choosing harder images from TinyImages (by which, moving away from the original distribution).

TinyImages: the noisy ancestor of CIFAR10

With CIFAR10.1, the authors could now compare the performance of many different deep learning models on the original test-set (CIFAR10) versus the new test-set (CIFAR10.1). The results show a decrease in accuracy for all models. But the interesting thing was that the order remained: that is, models that were better on the original test-set were also better on the new test-set. This is in stark contrast to what we might anticipate, because the more recent works are a result of significantly more adaptivity.

A good linear fit between accuracy w.r.t the original test set and the new test set.

This seems to disprove the original hypothesis: that is, despite years of adaptivity, there is no empirical evidence that we are overfitting to CIFAR10.

These preliminary results should be interpreted with caution. Here is a partial list of potential objections: Maybe CIFAR10 is just too easy a dataset, that is just difficult to overfit? Maybe the increased “rigour” standard in publications means recent papers need to test themselves better, and so they tend to overfit less? Maybe there are other effects (i.e, the new models are just inherently better), thus eliminating some of the effect of adaptivity? There is room for much more empirical work before we can draw clear conclusions.

If not the adaptivity issue, why was there a decrease in accuracy? The authors seem to think that this is due to the fact that there was still some shift in distribution from D to D’. This isn’t the focus of the paper, but it is actually very bad news: humans still get 100% on the new test-set, so this is yet another evidence for how brittle even our best machine-learning models are. This seem to call for more work on robustness, not just with relation to security issues (adversarial examples), but as a crucial component if we want our models to perform well in the real world.

Found this article useful? Follow us (Comet.ml) on Medium and check out some other relevant articles below! Please 👏 this article to share it!

Follow Gal Yona on Twitter and Medium

--

--