Deep Nets Don’t Learn via Memorization

Krueger, David, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S. Kanwal, Tegan Maharaj, Emmanuel Bengio, et al. n.d. “Deep Nets Don’t Learn via Memorization.” …

Main Points:

  • Previous paper ( claims DNNs can easily fit a random dataset (random label) using standard SGD and that regularization doesn’t help.
  • This paper claims that DNNs don’t memorize the dataset and found that DNN trained with real data learned simpler representation. The authors claim this by showing that as noise level in the label (or input) increases, a bigger network is necessary to achieve similar level of accuracy.
  • They also found that gradients are “sharper” for network trained with higher noise level. (Keskar, Nitish Shirish, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.” arXiv [cs.LG]. arXiv.
  • The authors found that regularizations (e.g. dropout, L2) can limit training accuracy in noisy dataset.

My takeaways:

  • It’s no surprise that having noise in labels is more difficult for the network to learn than having noise in images.
  • I feel that drawing this conclusion requires more evidences: Real dataset requires smaller network, noisy dataset requires bigger network. Therefore, DNNs learned simpler hypothesis.
  • I don’t think memorization is always a bad thing. In fact, I think some degree of memorization should help network to learn discriminating features (esp. in extreme cases such as one-shot learning and memory augmented network).
  • More work needs be done on measuring effective representation power