It is Not Only Deep Learning That Requires Rethinking Generalization

Takuya Akiba
3 min readMar 1, 2017

--

TL;DR, it is not only DNNs that can fit random labels under a setting where correct labels can be fitted with small generalization errors.

The recent paper “Understanding deep learning requires rethinking generalization” by Chiyuan, et al. is attracting high interest as it is an unintuitive and surprising fact that standard DNN models can even (over)fit random labels as well. I think the paper deserves the highest point at the ICLR’17 review as it brought us a great insight. However, as only DNNs (and some kinds of linear models) are discussed in the paper, it is natural to have the following question: how about other machine learning algorithms?

So, I conducted brief experiments that are similar to the original paper with our favorite tree-based machine learning algorithms: random forest (RF), extra trees (ET), and gradient boosting decision trees (GBDT).

Experiment 1: RF, ET, and GBDT

The following figure illustrates the result of our first experiment, where I trained and tested the three methods for MNIST dataset using correct and random labels.

We used 60,000 examples for training and the remaining 10,000 examples for testing. For the random label setting, we applied random permutations to the target labels of training examples. Solid and dashed lines correspond to training and testing accuracy. I used scikit-learn for random forest and extra trees, and XGBoost for GBDT. Default values are used for all parameters. The code used for the experiments is available here.

From the result for correct labels (the upper plot), we observe that all three methods can learn original MNIST with small generalization error (i.e., small difference between training and testing accuracy). On the other hand, we see that the training accuracy of RF and ET gets quite high with moderate numbers of trees (the bottom plot), which indicates that they can fit random labels, too, under the same configurations. This is the same phenomenon as those reported in the original paper for DNNs.

Experiment 2: varying a parameter of GBDT

In the above result, GBDT seemed an exception, i.e., it did not fit random labels. However, we are going to see that it is not the case. The below figure depicts the result where I conducted the same experiments varying the max_depth parameter of GDBT.

We see that, for relatively large max_depth values, GBDT fits random labels, while it has small generalization errors for correct labels.

On the other hand, I could also prevent RF by using different parameters from fitting random labels under the restriction that it can learn correct labels well.

Summary and Discussions

The key takeaway of this article is as follows. It is not only DNNs that can fit random labels under a setting where correct labels can be fitted with small generalization errors. We observed the same phenomena with random forest, extra trees, and gradient boosting trees.

Possible points of discussions are as below.

  • My experiments only show that there exist configurations where they can fit random labels under the restriction that they can learn correct labels well. However, it is not clear that they are common configurations.
  • It might be more insightful if we could have the numbers of total tree nodes as x-axis (instead of the numbers of trees), as they would be more relevant to the ‘capacity’ of models.
  • Also, it might be interesting to conduct quantitative investigations on the relationship between generalization errors and ability to fit random labels.

The code used for the experiments is available here.

--

--