Neural networks learns via memorization?

Recent paper “” of “Deep nets don’t learn via memorization” shows an interesting figure of generalization capability of several regularization methods. They are drop-out, Gaussian noise at the input, masked binary noise at the input, and weight decay.

Weight decay is a method that regularizes a networks via a factor on L1 or L2 value of the network weights. Methods of Gaussian noise and masked binary noise at the input introduce randomness on the input. Drop-out introduces randomness on connection weights between nodes in adjacent layers.

What would one expect if neural network learns via memorization? What if one train two networks, one on the real data ( called A-network) and one on a data that has labels randomized (called B-network). We might tune the regularization or randomness introduced in these methods on B-network. For a certain performance of B-network, we measure the performance of A-network with this amount of regularization.

Intuitively, if learning via memorization, with the increase of randomness, one would expect that, if B-network drops, A-network would also drop; i.e., a tilted curve.

Contradictory to the above hypothesis, the experiment result in the figure from the paper shows that these methods have flat curves to some extent. Therefore, neural network is not learning via memorization. It finds something intrinsic in real data.

However, the plots are drawn with two different networks. Is this a problem? I think so …

So what neural networks learn is still open to debate.

See the figure below.