Deep Regularization

On over-parameterized learning

Nicholas Teague
From the Diaries of John Henry
4 min readDec 19, 2020

--

This will be a short essay, wanted to just document a theory that I think is helpful way to think about deep learning. There is an open question in research as to why deep over-parameterized models have a regularizing effect, even when the number of parameters exceeds the number of training data points — which intuition might suggest would result in a model simply memorizing the training points, but in practice this type of deep learning instead successfully achieves a kind of generalization. I will present here a concise explanation that draws on an analogy to geometry in high dimensions to serve as explanation.

First consider the equations for a three dimensional unit sphere: x² + y² + z² = 1, where the volume is simply 4/3 * pi * r³, which for unit sphere is 4*pi/3. Now consider a hypersphere where we increase the number of dimensions governed by the similar formula v² + w² + x² + y² + z² + … = 1. Long story short, we find that both the volume and surface area briefly increase with increasing dimensions until they reach a peak, after which point they progressively shrink to an asymptote at zero. I believe this type of property is general to other geometric figures, for instance I have seen a similar demonstration for hypercubes, the conjecture below is based on this assumption.

Volume and surface area of unit hypersphere with increasing dimensions (image via wikipedia)

One way to think about the loss function of a neural network is as an unconstrained formula with the weights as the variables, e.g. for cross entropy loss J the formula is J(w) = ?, and through backpropagation we are trying to minimize J(w). However when you consider that the fitness landscape will in general have a single global minimum, the loss function through backpropagation is shifted in direction J(w) -> L, where L is approaching a constant. However, this also applies to any given value for L, that is for any given loss, the formula J(w) = L is a constrained formula where each weight has some distribution of potential values associated with that loss, similar to how in a geometric figure there is some distribution of each variable associated with a specific volume. Thus J(w) can be approximated as a constrained formula around the weight set associated with the global minimum as well as for losses in the backpropagation states preceding reaching the global minimum, and where the distribution of each weight should shrink as the loss approaches the global minimum.

Now drawing further on our geometry analogy, the properties of volume and surface area for increasing dimensions translated to our high dimensional loss function J(w) is really just another way of saying that with increasing dimensions by parameterization the degrees of freedom associated with each weight corresponding to a given loss value will be diminished, kind of similar to what happens with L1 regularization which promotes collective sparsity of a weight set. However here am not talking about the sparsity of the collective weight set at the global minimum, more referring to the distribution of each weight as found within proximity of a given loss value corresponding to some range of states in the fitness landscape, as in for example a given weight wi, for that distribution of wi for weight sets corresponding to a given loss value, the sparsity of that weight wi distribution will increase with increasing number of collective weights / parameterization. This is somewhat of a conjecture.

What I am trying to demonstrate here is that with increasing parameterization, the trend toward decreasing volume and surface area of geometric figures is similar to a trend towards increasing sparsity of each weight’s distribution associated with each loss value which enforces a kind of regularization on collective weight sets such as to decrease the degrees of freedom for a weight’s distribution, which is the explanation for regularizing effects of deep over-parameterized networks.

Yeah now just need to figure out how to test this conjecture.

Chopin’s Valse in D flat

Further Reading

--

--

Nicholas Teague
From the Diaries of John Henry

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.