Global Minima
Data types matter
One
Looking back on the ICML conference, one of the eye opening takeaways to this author was actually associated with no more than a brief comment by a presenter. In Weinan E’s keynote Towards a Mathematical Theory of Machine Learning [1], he shared a slide commenting on the global minima of a neural network’s fitness landscape; which in the context of a machine learning research conference on its own is not exactly earthshaking, however it still was kind of a revelation to my mental model.
Specifically, Weinan noted that the global minimum of an overparameterized model will not be a single point, it actually transitions at some scale of parameters to becoming a submanifold — as in having a range of possible weight values all sharing a common loss value at the minima of the optimization’s fitness landscape. And while he noted that continued training after reaching that submanifold will result in oscillations, that alone wasn’t really interesting to me, after all the stochastic nature of SGD suggests such oscillations are bound to arise.
The revelation came from my considering what could be the cause of such a submanifold arising in the context of the geometric regularization conjecture that I have documented previously, see e.g. these two essays:
And long story short, I think there is a very simple explanation that arises from the conjunction of the geometric regularization phenomenon coupled with the practicality of numeric representations of weights, activations, and gradients that collectively parameterize a loss function.
It isn’t that geometric regularization is shrinking the count of available function representations, it is just that it is squishing them to numeric values falling below the capacity of the data type representing these parameters. (I expect not squished to underflow territory, more resulting in delta updates from gradient steps falling below the step size available to increment a value within the capacity of a data type representation’s bit registers).
Simple.
Two
This is another phenomenon that I have been meaning to write about for a while which might be at least tangentially related. I have found possibly a new form of gradient update that may easily be applied once a model has reached a state of overfit to recover and with a few more epochs achieve even better performance. I call it the slingshot maneuver, and it basically amounts to training a model to overfit, applying a global constant multiplier to all weight values to shrink by a fixed amount (e.g. ~0.2), and then simply train few a few more epochs to recover the original scale of weights. In the context of a Keras model that weight scaling amounts to applying:
model.set_weights(np.array(model.get_weights()) * 0.2)
I expect the reason that it works is that the higher frequency components being modeled are those aspects most contributing to overfit, and by shrinking the weights collectively, those higher frequency components fall off of the data type register, resulting in a model of lower frequency registers (as in frequencies of the Fourier representation sense).
I have found that the approach appears to work best with smaller data sets and models, so might not be universally beneficial. Still it is an interesting phenomenon that I wanted to report in case others may like to conduct more extensive experiments.
Perhaps another way to think about it is that it is like tuning a guitar. If you start by making a string really flat, then you know you have but one direction to turn that knob to get the string to pitch.
For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com
References
[1] Weinan E. (2022, July 19). Towards a Mathematical Theory of Machine Learning [Conference Invited Talk]. International Conference on Machine Learning, Baltimore MD. URL: https://icml.cc/virtual/2022/invited-talk/18430