Dimensional Framings of Overparameterization

Getting from geometric regularization to double descent

Nicholas Teague
From the Diaries of John Henry
36 min readJun 29, 2022

--

Part 1 — Geometric Regularization from Overparameterization

Background

This essay is meant to document some of the author’s more recent musings on subject of the geometric regularization conjecture, which was a novel (and somewhat speculative) framing of emergent regularization properties that may arise from overparameterized models, explored in some depth in the paper Geometric Regularization from Overparameterization Explains Double Decent and Other Findings, more recently turned down for publication from the ICML conference (for reasons of “insufficient equations” as far as I can tell), and surprisingly turned down again by the “TAG” workshop for geometric learning at the same conference.

As the conference’s review process required effectively swearing an oath of silence around social media distribution the paper largely went unnoticed at the time of publication. Hopefully these expanded discussions may serve the purpose of drawing additional attention as I have come to understand that attention is all I need (sorry bad pun). Through the progression of this admittedly somewhat ridiculously long dialogue we’ll try to establish an improved explanation for both the geometric regularization conjecture as well as relevance to the double descent phenomenon. Yeah so rather than restating everything in entirety will first provide here the preprint link for a reader’s perusal. Ya’ll come back now ya hear?

The conference review process was not totally without merit. I did find that the reviewers were helpful in evaluating relative merit of different aspects of the paper, and provide here my submitted one-page response which summarizes those edits that went into the subsequent workshop draft. (Just in case future grandkids may someday be curious just trying to present my side of accounts :).

Response to Reviewers:

Thank you reviewers. We’ll first jointly address your feedback followed by direct response to specific points. In short, we propose significant scope reduction to address concerns.

Broadly, we are interpreting your feedback to suggest that positive aspects of the paper are that the geometric regularization conjecture is novel, significant, would be of value to the community, and that the related work section is thorough. However in current form detractions include limited theoretical backing and the histogram demonstrations are considered unconvincing or not of significance. There is also concern that some discussions surrounding histogram interpretations aren’t well supported.

The histogram demonstrations were included to illustrate characteristic patterns of weight distribution volume transience that would be experienced as backpropagation traverses a path of decreasing loss, with trends demonstrated for variations on width/depth parameterization, sample counts, and initializations. We acknowledge that surrounding discussions were at times speculative, which was a product of the limited fidelity of the histograms. Loss manifold histograms have not been considered in the literature to our knowledge, and each setup was used to explore some distinct patterns that became apparent at small scales. We drew some inspiration from Stephen Wolfram’s explorations of cellular automata and hyper-graphs, in which he surveyed and catalogued various patterns that arose across simulated configurations. Such exploration was the only intent of the appendix, as we found the exercise helped us gain intuitions.

We expect the research community would be benefited by distribution of the geometric regularization conjecture, especially considering the implications as model scalings progress into the trillions of parameters. We propose dropping all material related to histograms and their interpretation and limiting scope to the conjecture and related work (Sections 2 & 3) with corresponding edits to the introduction and conclusion (and dropping phrase “and other findings’’ from the title). We hope such an abbreviated form would be considered more appropriate to the venue.

Reviewer #5 “The authors miss recent work on understanding neural network loss surfaces such as (Golblum et al., ICLR 2020), (Li et al, NeurIPS 2018).”

We added reference to both papers in the related work, making note of Li’s observation on prevalence of non-convexities in wide vs deep networks and Golblum’s discussions around the neural tangent kernel in presence of skip connections or batch normalization.

Reviewer #6 “There is no mathematical definition of the geometrical implicit regularization. It is not clear what it means”

We describe a loss manifold’s weight distributions per J(w)=L as a geometric figure, and infer by relating to hyperspheres how the volume corresponding to a given loss J(w)=L_i should shrink with increasing parameterization. We note the main idea of regularization theory is to restrict the the class of admissible solutions by introducing a priori constraints.

Reviewer #6 “J(w)=?”

Added clarification “which abbreviation is meant to generalize the conjecture across loss metrics e.g. mean absolute error or cross entropy”.

Reviewer #3 “The authors draw similarities between surface area and volume of hyperspheres, but do not properly explain why we should expect the loss manifold to behave similarly.”

We note that “zero asymptotic volume convergence appears likely to arise when a shape is constrained through dimensional adjustment to a single scale, such as for a hypersphere could be the unit radius or for a loss manifold could be a specific loss value.”

Reviewer #6 “in double descent there are two peaks, and in the hypersphere example there is only one.”

We propose adding the following clarifying sentence at line 118 in Section 2 after the phrase …explaining the emergence of double descent: “This can be inferred because we know that at a global minimum the loss manifold distribution at J(w) = L_min, as a point, will have effectively zero dimensions, so we expect that through training the path will traverse a surrounding loss manifold tendril of shrinking effective dimensionality, with re-emergence of geometric regularization when that surrounding volume retracts across its own peak in a volume verses dimensions curve.”

Reviewer #3 “The suggested hypothesis is only a conjecture, which lack proper theoretical basis.”

For additional theoretical context, we now relate our work to the generalized Poincaré conjecture as “The generalized Poincaré Conjecture, proven for dimensions >4 (Smale) and later extended to 4 (Freedman) and 3 (Perelman), suggests that every simply connected, closed n-manifold is homeomorphic to the n-sphere, which demonstrates that topologically related manifolds have one-to-one point correspondence. Increased regularization with overparameterization could be interpreted as empirically suggestive of a similar dimensionally asymptotic volumetric correspondence.”

We also added to the related work a reference to overparameterization in quantum neural networks (Larocca et al., 2021) and “Recent work from (Hoffmann et al., 2022) has demonstrated one can balance parameterization scale with number of training tokens in large language models for purposes of optimizing training-compute cost and performance. This is consistent with the geometric regularization conjecture as the asymptotic trend of a flattening volume curve suggests a point of diminishing returns from overparameterization.”

Back to the Present

It was perhaps foolish to do so, but in the time since the disappointment surrounding the paper rejection the author has not devoted a great deal of attention to the conjecture. This was partly a product of prioritizing research surrounding the Automunge library (which has potential applications in mission critical environments), but also arose from a lacking of clarity of how to extend the inquiry.

Slowly, and at times intermittently while thinking about other things, musings around the conjecture have dripped into my notebook, which I think have begun to coalesce into at least the beginnings of a cohesive framework of dimensional framings surrounding overparameterization. As I admittedly lack the skills for what might be needed for a rigorous theoretical backing, I offer here these musings in the form of prose, with the hope that those more skilled than myself at symbolic manipulations may be able to translate to a form more suitable for gatekeepers of the like of ICML reviewers. Some of this will be speculative, but we have attempted to maintain a grounding in accepted theory and otherwise call out those conjectures yet to be substantiated. Yeah so here it goes.

Part 2 — The Long Road to Double Descent

Benjamin Franklin quote inspired by essay from science writer Steve Aaronson
No Code — Pearl Jam (album)

Data Set Dimensionality

Revisiting one of our comments responding to the very smart reviewer #6, we tried in our response to expand on the conjecture as framed in the original preprint by way of refining considerations surrounding the double descent phenomenon. More specifically, we considered the contraction of intrinsic dimension of the range of possible modeled data generating functions with decreasing loss value, with the following bold text added in the draft submitted to the workshop:

The expectation is, if an adjacent regularizer is not dominant, a phase change to geometric regularization will emerge as a training path reaches loss values approaching a global minimum due to the overparameterized geometry aligning with the distribution volume reductions from fewer weight sets capable of achieving such loss, explaining the emergence of double descent. This can be inferred because we know that at the global minimum the loss manifold of Lmin will have one intrinsic dimension, so we expect that through training the path will traverse a loss manifold tendril of progressively shrinking dimensionality, imposing progressive geometric regularization once that surrounding intrinsic dimensionality retracts across its own peak in a volume verses dimensions curve.

In hindsight we may have erred on the side of brevity in presenting this comment, especially considering the significance of the implications. So let’s try here to expand on what we were trying to demonstrate.

Intrinsic dimension, as a formal term, we suspect may at times be subject to a little ambiguity in interpretation between researchers. As a first agenda in this dialogue, let’s try to consider all of the ways that the dimensionality of a data set may be framed or constrained. I find that for describing considerations in all kinds of ML theory, it helps to frame an analogy in the tabular modality (i.e. feature set columns and observation rows), which is just a more intuitive application than various other ML targeted data types.

So what does it mean to talk about the intrinsic dimensionality of a tabular data set? Well first consider that a data set of any interest will not have purely random entries. There will be some underlying distribution of possible entries. We could look at it in a few ways depending on whether we had knowledge of surrounding features or labels, e.g. the distribution of possible entries found within a feature without knowledge of corresponding labels, the distribution of possible entries found within a feature with knowledge of corresponding labels, or both scenarios again considering prior knowledge or not of surrounding features. Given the risk that all of these alternate framings may overwhelm the narrative, let’s back up and just consider a single case.

Let’s say we have a tabular data set with a single feature containing numeric float entries of an unbounded range. Without knowledge of labels or surrounding features, we could describe this data set as having an intrinsic dimension of 1, as entries will fall somewhere along the line between negative to positive infinity. Now let’s assume that we have visibility of labels (or this same thought experiment works assuming we have visibility of an adjacent feature with small reframing). If our single feature is fully describable by the value of a corresponding label (e.g. y = mx + b), then even with knowledge of labels, we don’t gain any additional degrees of freedom for the feature/label set value, and thus we suggest that our pair of a feature and corresponding labels retain an intrinsic dimensionality of 1.

This observation is getting a little ahead of ourselves, but we noted in the geometric regularization writeup that it has been demonstrated by (Dhifallah & Lu, 2021) that incorporating some number of additional perturbation vectors of injected noise into a data set appears to result in a proportional increase to the number of parameters needed to reach the overparameterization regime. We’ll later attempt to tie this into the thought experiment in more detail below, for now let’s use this as an excuse to consider the impact of a noise perturbation vector added to our simple numeric feature in a manner independent of the corresponding label as would be based on the feature prior to the state with the injected noise. In short, we’ll find that the added degree of freedom around the feature value in relation to the label means that our prior single (1) intrinsic dimension of the feature and label set is insufficient to describe the degrees of freedom found after the noise. In other words, injecting noise into the feature increases the resulting intrinsic dimension of the system consistent to the number of degrees of freedom associated with the noise profile — what Dhifallah & Lu refer to as the number of perturbation vectors. (Where for example noise sampled from a single Gaussian distribution has a single perturbation vector, or e.g. noise sampled from the additive combination of two differently scaled Gaussian samplings would represent two perturbation vectors.)

Now backing up from noise considerations, let’s say that our feature isn’t an unbounded continuous numeric, but instead a categoric feature with a bounded range of distinct possible values. Without knowledge of labels, how can we describe intrinsic dimension of that feature? I would suggest that if the categoric feature had an infinite range of values, it would be comparable to our continuous numeric feature, and thus retain an intrinsic dimension 1. At the other end of the extreme, if the categoric feature only had one possible value, it would have an intrinsic dimension of 0. Thus on its own a categoric feature will add to the collective intrinsic dimensions of a feature set by an addition of a float somewhere between 0–1 to an aggregated dimensional figure count. Now if we have label visibility, and a single categoric feature can be directly translated to that label, then we return to the a similar conclusion as the numeric case as the addition of the label set does not change the prior derived intrinsic dimension of the categoric feature. If however the label derivation is not fully captured by the categoric feature value, such as would be the case if knowledge of the feature only gave us clarity of the label within some statistical confidence value, well then the addition of the label set to our framing will increase the resulting collective intrinsic dimensionality in aggregate due to an added amount of variable degrees of freedom without the framing. (Another way to phrase the term “degrees of freedom” as used here could be “uncertainty”, as could be more directly evaluated by a Shannon Entropy metric derived as a negative sum of log likelihoods, or for quantum neural networks by von Neumann entropy, see e.g. (Witten 2018).)

Turning to other modalities, let’s consider extending this type of thought experiment to image data types. For simplicity imagine a set of gray-scale pixels, but with high pixel density, and our boolean labels are associated with detecting presence of a uniformly shaded circle against a white noise background. Now even though a feature set for a single sample image may have thousands of pixels, if the captured representation is of a simple geometric figure like a uniformly shaded circle, then the pixels capturing the circle representation will have grayscale values fully correlated with their surrounding neighbors, and we could thus say that for those effected pixels, the intrinsic dimensionality of the data generating function, i.e. of the circle, will be based on the dimensionality of the circle (as could be derived by a single metric of the circle’s radius) as opposed to the dimensionality of the pixels, while the dimensionality of the white noise background environment will simply be additive from pixel counts. This finding can actually generalized to any modality which we can summarize to state that the intrinsic dimensionality of a data generating function in any modality will likely fall well below the dimensionality of the environment, or more succinctly the intrinsic dimensionality of the data generating function in most natural cases can be expected as less than or equal to the dimensionality of the environment. (Considering one boundary condition for our thought experiment, if the background was a uniform color instead of white noise the intrinsic dimensionality of the data generating function would converge to that of the environment, as another boundary condition if we invert to a solid color background with a white noise filled circle the data generating function would actually have a higher intrinsic dimensionality than the environment.)

Now consider that for our image modality framing, this array of pixel values can actually be represented as a matrix, and thus can be equated to the tabular modality that we noted earlier, so all of the same lessons should apply to both. (You may be used to the convolution conventions for image modality of an inductive bias based on adjacent pixel correlations, where such adjacent correlations will likely be missing in the tabular modality, but the calculation of intrinsic dimensionality need not consider order of adjacent pixels / features, simply degrees of freedom (uncertainty) of values across aggregate features.)

To consider more generally what it would mean to extend the lessons from our dimensional assessment for an image of a gray circle on white noise background to the tabular modality, let’s consider another representative composition as a tabular data set with three features and a binary label, where we can simplify and make them all categoric features, each with range of potentially three distinct values, and the labels a boolean set. We would expect that with visibility of our boolean labels, we would be able to lower the degrees of freedom / uncertainty of what entries would be found in the corresponding feature sets, and further that there would often be some kind of correlation between the entries in one feature and corresponding entries in an adjacent feature. In a simplest case, let’s say a boolean label of True would give us full certainty of what values would be present in the three features, which example resembles the image case where we had a circle on uniform color background, as each would have an intrinsic dimension matched between the environment and the data generating function. Now for this same case, if a boolean label of False only gave us partial clarity of what values would be found in the corresponding features, then the intrinsic dimensions of the False label composition would be higher than that of the True label case. The aggregate intrinsic dimensions across labels I expect could be considered in a kind of additive fashion (which I expect would converge to the dimensionality of the system without visibility of labels), and I defer to readers who may be more talented in theoretic derivations to consider formally.

Summarizing the last few paragraphs, we thus have hopefully established some amount of intuition based on simple configurations and boundary conditions about what it may mean to talk about intrinsic dimensions of a data set. More particularly, we’ve considered intrinsic dimensions of a data generating function in comparison to that of the environment, extended to scenarios of label visibility and equated between tabular and image modalities. We’ve also noted that the dimensions of this data generating function can be increased in an arbitrary fashion by way of compositions of injected stochastic noise into the feature representations. As an asterisk, please note that in the language of information theory, the act of signal transmission will have its own component of signal interference that could be framed as additional vectors of noise interference (which may include factors like limitations of sensor precision, label annotation accuracy, or other environmental factors).

Modeling a Data Generating Function

Now let’s try zooming in from another direction. Instead of considering intrinsic dimensions of the data generating function within context of an environment, let’s instead consider intrinsic dimensions of a modeled data generating function, as in what would be represented by the function transforming feature samples to label predictions in the inference operation for a trained neural network. It is not a great leap to expect that the configuration of the neural network, such as considerations like activation layers’ width and depth, may influence dimensionality of the modeled function, which we’ll later attempt to draw from in discussions surrounding the geometric regularization conjecture for overparameterized models — but first let’s try to lay a foundation, again based on what can be inferred from simple configurations and boundary conditions.

Restating some of the discussions from the preprint linked above, let’s first consider the difference between parameterization arising from deviations in layer width (number of activations in a layer) verses layer depth (number of layers). Note that when (Roberts & Yaida, 2022) sought to formalize a theory behind overparameterization, they considered the ratio of depth to width in a network as central factor in their derivations. We believe that our paper has identified a more granular framing that is perhaps more informative to the mechanisms behind the influence width verses depth configurations. We refer to this framing as the “ratio of influence”, which has the benefit of available derivations for configurations of mixed width layers, skip connections, and other exotic setups. Since this framing should correlate with the more basic ratio of depth to width, we hope it could be directly substituted in those authors’ derivations in an ad hoc manner. Here again is the description of the “ratio of influence” that we offered in the preprint:

Consider the toy example where you have 3 features fed to 2 dense layers of 6 neurons. In this case the 18 weights from the 6 first layer neurons influence the 36 weights of the 6 second layer neurons, resulting in a 36/(18 + 36) = 0.67 ratio of influence. Now consider the alternate configuration with equivalent parameter count where you have 3 dense layers of 4 neurons. In this case the 12 weights from the 4 neurons of the first layer influence a combined 32 weights of the layer 2 and 3 neurons, and the 16 weights of the 4 second layer neurons also influence 16 weights of the 4 third layer neurons, resulting in a (32 + 16)/(12 + 16 + 16) = 1.09 ratio of influence from the same number of weights. We suspect each of these channels of influence adds an additional piece of complexity capacity to the model. For the two configurations, when realizing the same loss value they are modeling from the same set of candidate functions. Because the deeper network has more complexity potential, it can represent the same set of functions with sparser weight sets.

As an additional clarification to the ratio of influence, note that it could be evaluated in two ways. From an overall model capacity standpoint, it could be evaluated by taking into account all weights present in the architecture. In a more granular approach, it could alternatively be calculated again after omitting any activations found in a zero state (as ReLU activations output 0 when inputs sum to a negative value).

One of the novel aspects we are trying to convey with this ratio of influence framing is considerations surrounding complexity capacity of a modeled function as could be compared between different configurations at a common number of parameters. In the real world we don’t have infinitely wide neural networks, and so some balance could likely be identified between deeper vs wider networks, which some have reported finding deeper more difficult to train than wider networks (Jacot et al., 2018). We offer here a speculative conjecture that there may be some benefit in ease of training for attempting in some range of precision to match the complexity capacity of the network by a depth to width configuration to the complexity capacity of the actual data generating function, particularly from a resource utilization in inference standpoint.

Complexity Capacity of a Network

Backing up for a little bit, we’ve introduced the term complexity capacity, which is not a term of common use in mainstream ML research, so let’s try to expand a little bit on what is meant. We note that (Lloyd, 2001) offered a succinct although possibly non-exhaustive survey of complexity measures that had been considered broadly, as used here we are leaning towards the framing of complexity considering “how hard is it do describe?”. (Using a measure of Fisher Information to describe a network’s information capacity has definitely been considered in prior work, we recall discussions at the Information Theory in Machine Learning workshop at NeurIPS 2019 for instance.)

We suggest that it could likely be demonstrated in simple experiments that for a common number of parameters, a deeper network will be capable of capturing more complexity than a wider configuration, and we would consider a formal validation of this “ratio of influence” framing if it could be found some way to directly correlate between a calculated ratio of influence metric for variations in architecture at some number of parameters to derive resulting translations to a complexity metric, and thus hereby offer a formal prize of a slice of pizza and a soda to any entrepreneurial graduate student who may get published at formal venue such a formalization (to make the prize slightly more appealing, the toppings on the pizza slice can optionally be sourced from anything on the eatery’s menu; we suggest finding somewhere that also sells lobster or filet mignon).

Having considered complexity capacity of a network, and getting back to the intent of the essay to expand on the geometric regularization conjecture, that leaves us with a question of whether we can try to relate between complexity capacity of a network’s parameterization / configuration and intrinsic dimension of that network’s modeled data generating function. I find it helpful to consider boundary conditions, so let’s consider this question separately at model initialization and then again at the state realized after completion of training, which we will later use to infer what related conditions may be experienced by a training path traversed through backpropagation.

Backpropagation’s Boundary Conditions

Questions surrounding appropriate distributions for weight sampling at model initialization are well explored in the literature. We noted in our preprint that He initialization (where Gaussian scaling is based on the width of the preceding layer) has by some been considered superior to Xavier scaling (in which Gaussian scaling is based on width of both preceding and subsequent layer) as it has been found to lie closer to the boundary of the well-performing regime (Song et al., 2021). We won’t go down that rabbit hole further other than to note that in each of these initialization scenarios, which are broadly followed in mainstream practice, the result is simply that the weights received by training are sampled from a symmetric Gaussian distribution of a layer-specific derived scaling. (Noting that outside of the infinite width case or other forms of inductive bias the resulting initialized function will likely have no relation to the actual function we will try to approach through training.)

Following a set of Gaussian sampled entries, a naive intuition might be that their complexity capacity might resemble that of white noise, as in a maximum modeled information density in comparison to the model architecture’s complexity capacity. However after giving some thought we suggest that the maximum complexity capacity for a sampled initialization will actually correspond to sampling from a +/-(inf) range uniform distribution, and such capacity will approach zero with progressively shrinking range of uniform sampling or progressively shrinking scale of Gaussian sampling. At such fully random states of initialization, there is no predictability of the direction and magnitude of a first training epoch’s update step, and thus we can say that prior to the sampling’s “measurement”, the initial backpropagation update step is in a “superposition” of all update directions and magnitudes as are available within the boundaries of the complexity capacity from initialization scaling and the parameterization count.

Another boundary condition we are interested in is that achieved at the conclusion of training. Assuming that we have simply the best “graduate student descent” resources at our disposal, let’s say the fruition of training successfully reaches the absolute global minimum of a loss function’s fitness landscape, sort of a best case scenario. In this case attempts for further progression of backpropagation would not realize any weight updates as a gradient flow in direction of decreasing loss wouldn’t be present. Thus, the degrees of freedom for training updates at this point is effectively and specifically 0. (Although as an asterisk that we’ll highlight later in this dialogue, aspects of stochasticity present in the training algorithm or insufficient data type representation capacity may mean that we never reach an exact global minimum, please consider the preceding an ideal case).

A Training Path’s Degrees of Freedom

Now that we have our boundary conditions, let’s explore the conditions experienced throughout a training path from sampled initialization to reaching global minimum. Consider first the gradient update types of stochasticity present associated with e.g. minibatch compositions, conditional directional deviations from a pure gradient signal from learning rate / momentum (e.g. from Adam, AutoGrad, etc), dropout regularization, or the various other forms of stochasticity that may be found in training. In order to simplify, we could aggregate all of those sources of stochasticity into a common term, which we’ll consider as a stochastic gradient signal perturbation vector of exotic composition. (This might be an oversimplification for dropout, if that bothers you then pretend we are training without dropout.)

Now first restating what we identified above around initialization, we can consider a training path’s degrees of freedom before and after sampled initialization. Prior to the sampling and while the initialization is still in a “superposition”, the degrees of freedom to our first training step will be effectively infinite, although still constrained by the complexity capacity of the form of initialization (e.g. the difference in complexity capacity between +/-(inf) uniform sampling verses a layer-specific scaled Gaussian). Once we sample our initialized weights, there will be a “superposition collapse” for an immediate step change to our first training update’s degrees of freedom.

These terms borrowed from the quantum realm like “superposition”, “collapse”, and “measurement” are merely there to support consideration of states of knowledge at various points in time that can be translated to degrees of freedom available to backpropagation training associated with different stages of a training path, although the intent is that a reader should simplify their interpretation to a classical probability framing to omit complications like entanglement. We first introduced to this blog the term “non-quantum superposition” in a somewhat whimsical paper titled Dropping Out for Fun and Profit (Teague 2019). Now consider these quantum analogues will also come into play when we consider the aggregated perturbation vectors from all of those sources of stochasticity in training (referring again to sources like minibatch compositions, dropout, etc). After all, without any form of stochasticity, backpropagation could be considered a deterministic process, as each update step would merely follow a path predestined from the (initialization / learning rate) of following a minimum gradient signal until reaching a local minimum, realizing an effective zero degrees of freedom along the entire route. However, when we include the various sources of stochasticity that may be present in training, suddenly a training path, after taking account for the superposition of possible update paths that may be realized through training, is taking place within a superposition of wide range of variational freedom, a nondeterministic process for any minimally sufficient source of randomness entropy seeding as would be available to most operating system environments. (As an aside, we now speculate that this expansion of the training path superposition is part of what is being recovered in dropout regularization, as the overlap between a set of stochastically sampled thinner layers’ gradient signal and the original composition’s gradient signal enforces a reduction in such superpositional degrees of freedom, resulting in a regularization effect verses what would be realized without dropout.)

We’ve used the term “degrees of freedom” pretty liberally in this section. Earlier in the dialogue we equated that to “uncertainty”, and even extended to measures of information content like Shannon entropy. The intent is that such interpretation can be extended to this non-quantum superposition of training path updates, as by limiting a point of inspection to a specific gradient step at a point of prior knowledge, we expect there is a way to frame the range of possible update steps in terms of negative sum of log probabilities. We consider the phenomenon as directly related to geometric regularization, which we’ll get further into below — in the preprint we pulled a nice quote from (Webb, 1994) as “the main idea of regularization theory is to restrict the class of admissible solutions by introducing a priori constraints on possible solutions”, which we think sums up the relevance quite well. The point is that restricting the degrees of freedom available to a (non-quantum) superposition of possible update steps enforces a regularization in such classical superposition space, lowering the uncertainty of future gradient update steps even in the context of various sources of stochasticity.

The main idea of regularization theory is to restrict the class of admissible solutions by introducing a priori constraints on possible solutions. — (Webb, 1994)

A Few Related Work Tangents

Having laid a lot of groundwork, the hard part is now how can we relate that framing to the double descent phenomenon. Let’s restate the update made to the geometric regularization preprint to serve as response to reviewer #6 (updated text shown in bold:)

The expectation is, if an adjacent regularizer is not dominant, a phase change to geometric regularization will emerge as a training path reaches loss values approaching a global minimum due to the overparameterized geometry aligning with the distribution volume reductions from fewer weight sets capable of achieving such loss, explaining the emergence of double descent. This can be inferred because we know that at the global minimum the loss manifold of Lmin will have one intrinsic dimension, so we expect that through training the path will traverse a loss manifold tendril of progressively shrinking dimensionality, imposing progressive geometric regularization once that surrounding intrinsic dimensionality retracts across its own peak in a volume verses dimensions curve.

First, as a kind of correction, we offered in this quoted excerpt of our arxiv preprint that at the global minimum the loss manifold will have one dimension. I think we framed this slightly incorrectly. Perhaps a better way to put this is that at the global minimum, the degrees of freedom available to a training path will have effectively zero degrees of freedom, although it should be noted that in the realm of classical superposition associated with sources of stochasticity there may still be some uncertainty surrounding whether an additional update step may realize a (probably smallish) change in weights. (This correction was included in the draft we subsequently submitted to the TAG workshop.)

The interesting part, and where we will later attempt to tie this all together, is when we start to consider the degrees of freedom associated with a training path in relation to the complexity capacity of the network and the intrinsic dimensions of the data generating function. After all, double descent has been demonstrated to arise as a result of exceeding the weight count of the the interpolation threshold, which has been approximated as the state where number of weights equals the number of training samples, although we note again the findings of (Dhifallah & Lu, 2021) (and particularly their Figure 1(b) shown shortly) suggest to us that increasing the intrinsic dimensions of a data generating function by way of additional noise perturbation vectors added to training data has a proportionate effect on the ratio of weights to samples needed to reach the overparameterization regime.

Ok, sorry to do this, but we opened a can of worms by bringing the interpolation threshold into play. As a brief rehash, “interpolation threshold” refers to a derivation commonly considered around the double descent phenomenon which considers model complexity based on the ratio of the aggregate number of weights to the number of available training samples, see (Belkin et al., 2019), and describes the model parameterization capacity needed to achieve the double descent phenomenon. We do not believe the findings of Dhifillah & Lu have been widely considered in research around the double descent phenomenon, so let’s take a brief tangent and try to consider the implications.

(Dhifallah & Lu, 2021) Figure 1(b) — (k=parameter count, n=sample count, l=count of noise perturbation vectors)

We interpret that by demonstrating their Figure 1(b), Dhifallah & Lu have revealed that the ratio of weights to training samples contains insufficient information to fully derive the location of an interpolation threshold, and given what we demonstrated early in the dialogue that each noise perturbation vector introduces an increase the intrinsic dimension of the data generating function based on the degrees of freedom of the noise vector (e.g. for a univariate Gaussian would be an increase to the intrinsic dimension of 1), this suggests that the correct way to formally derive an interpolation threshold would need to also consider the intrinsic dimension of the data generating function. (Perhaps more interestingly, we wonder if the correlation could be evaluated in the reverse direction, using the presence of an interpolation threshold by variations on parameter count for some form of derivation of the intrinsic dimension of an unknown data generating function).

We suspect that the demonstrated effectiveness of approximating the threshold by ratio of parameters to samples may partially be associated with phenomenons found in regimes of training data scale corresponding to partially represented data generating functions. We suspect that our prolonged focus on the tabular modality over the last few years may have given us some unique insights into considerations surrounding underserved training data, as we believe that common benchmark sets in the tabular modality are uniquely capable of fully (or at least nearly fully) comprehensively representing a data generating function, especially in comparison to other modalities like image or text where it may take orders of magnitude more samples for such a claim. For example, consider the deep learning data augmentation benchmarks towards the Higgs data set (Baldi et al., 2014) that we demonstrated in our paper Numeric Encoding Options with Automunge (Teague, 2020), in which data augmentation by noise injection was applied at progressively shrinking scales of data representation by way of fewer presented training samples before augmentation. We found that in the fully represented data set, the data augmentation had nearly negligible effect, but progressively as the original scale of samples was reduced, data augmentation had a seemingly proportionately pronounced positive impact — we speculate that this phenomenon could be considered related, or at least tangent, to the impact of scale of training samples to an approximation of the interpolation threshold.

Excerpted from Numeric Encoding Options with Automunge (Teague, 2020) => the final column demonstrates impact of data augmentation by noise injection at different scales of training data representation

We speculate that this distinction of demonstrated benefit of data augmentation in different regimes of data representation (i.e. training data samples at a scale of under represented to fully represented data generating functions) may yet be found to be a kind of boundary condition towards the usefulness of the common ratio of parameters to training samples, where beyond this regime of full functional representation additional parameters may not be needed.

Consider a possibly related phenomenon found in quantum neural networks in which model complexity with increasing parameterization eventually experiences a phase change towards a saturated curve (Larocca, 2022) & (Haferkamp, 2022). (Note that quantum neural networks targeting a quantum data set may differ from those targeting a classical data set in that we believe the data generating function corresponding to the sensor dimensionality will always be fully represented within the scope of the sensor superposition.)

Tweet features image excerpts from (Haferkamp) and (Larocca) at QIP0222 presentations

Now backing up from quantum neural networks, in further related work surrounding questions of under-represented training data in context of noise injection applied towards data augmentation, the discussions surrounding intrinsic dimensions presented in a dedicated section of the paper Stochastic Perturbations of Tabular Features for Non-Deterministic Inference (Teague, 2021) looked at noise injections impact towards intrinsic dimensions in the context of noise use towards data augmentation, and we described what we considered to be a reasonable inference that part of the benefit was likely associated with noise applied towards data augmentation diverting a model’s complexity capacity from modeling what we referred to as “spurious dimensions associated with gaps in the fidelity of the data generating function from underrepresented training samples” towards instead modeling “tractable dimensions associated with the additional perturbation vectors from injected noise”. (We don’t have rigorous theoretical backing for this explanation other than it appears to align with Occam’s razor.)

These various points around related work are interesting, but possibly don’t conclusively give us a way to describe what is the dimensionality being considered by the complexity capacity of the ratio of parameters to training samples. Let’s briefly pretend, for the sake of discussions, that what we noted above (about the potential for a model’s complexity ratio to saturate once a data generating function is fully represented in the available training samples) is more than just a conjecture so as to see what it would imply. If it can be found that the ratio of parameters to training samples impact towards the interpolation threshold eventually saturates in a manner resembling what has been found for quantum neural networks, then perhaps we need to consider two framings of model complexity, what we will refer to as 1) complexity capacity of representation and 2) applied complexity capacity of memorization. What we noted above for a tendency of complexity capacity of a model to increase with increasing “ratio of influence” at constant parameterization does not imply an eventual saturation of representation complexity capacity on its own, however once the representation complexity capacity of the network exceeds that complexity associated with the intrinsic dimensions of the actual data generating function, it is reasonable that it will eventually saturate in applied complexity capacity of memorization. We speculate the reason this effect may not yet have been demonstrated empirically is that experiments have yet to be conducted towards fully represented data generating functions, as we speculate that in cases of underserved training data, the “spurious dimensions” associated with gaps in the fidelity of the data generating function from underrepresented training samples are not a static entity, but that increasing model representational complexity will often only result in identifying and attempting to represent even more diverse or finely grained spurious dimensions, such that the only way that the applied memorization capacity could saturate would be in cases where the data generating function is fully represented by the presented training samples — this regime would likely be quite difficult to achieve in most modalities without synthetic data or simplified setups (like the circle on white noise framing), however we hope it may be demonstrated in deep learning applications applied to tabular benchmark sets like the Higgs data set (Baldi et al., 2014) which we think our benchmarks in (Teague 2020) suggest are much closer to a fully representational set of training samples in aggregate.

Tying in to Double Descent

Ok thank you for humoring us with your attention through those tangents. Now let’s turn back to the main question surrounding origination of the double descent phenomenon. To elaborate on what was proposed in our preprint we’ll first need to go on another tangent though. Sorry bout that. Yeah so if you’re up to speed on the geometric regularization conjecture from our preprint, then you’re familiar with the volumetric trends corresponding to increasing the dimensionality of a unit hypersphere, here is the figure we included in the preprint linked in part 1 of this dialogue.

Volume (V) and Surface Area (S) of unit hypersphere with dimensionality (n), Image via (Cmglee, 2018)

Restating the essence of the geometric regularization conjecture in its simplest form (thanks again reviewer #6:)

We describe a loss manifold’s weight distributions per J(w)=L as a geometric figure, and infer by relating to hyperspheres how the volume corresponding to a given loss J(w)=L_i should shrink with increasing parameterization. We note the main idea of regularization theory is to restrict the the class of admissible solutions by introducing a priori constraints.

So where does this volumetric peak manifest in double descent? Recall again our clarification to the reviewer:

we know that at a global minimum the loss manifold distribution at J(w) = L_min, as a point, will have effectively zero dimensions, so we expect that through training the path will traverse a surrounding loss manifold tendril of shrinking effective dimensionality, with re-emergence of geometric regularization when that surrounding volume retracts across its own peak in a volume verses dimensions curve.

Here we’re again referring to the degrees of freedom available to a training path update step, but this time instead of describing in terms of entropy, considering that distribution of possible update steps after taking into account any stochasticity included as a geometric figure with a dimensionality of its own. This differs from the geometric figure associated with the original broader geometric regularization conjecture, which was simply the distribution of all weights associated with a loss value, as in this case we need to take account for the specific weight configuration at the point where a next update step is considered for a training path. We use the term “tendril” to suggest that as the training path reaches loss values approaching that point associated with the global minimum, the surrounding intrinsic dimensionality of this smaller geometric figure will continue shrinking until reaching an effective zero dimensions at the global minimum. At some point along that training path, just like would be experienced by a hypersphere, the dimensionally of this geometric figure representation of the distribution of possible update steps will retract across a peak in a volume verses dimensions curve, below which there will be realized a re-emergence of geometric regularization, explaining the emergence of the double descent phenomenon.

quod erat demonstrandum

Part 3 — Conclusion

The Principles of Deep Learning Theory

Before closing, we would like to highlight that there has been some very interesting recent work on questions surrounding the behavior of neural networks in context of the overparameterization regime which does a much better job than our work of considering theoretical aspects in a rigorous manner, referring of course to the newly released book The Principles Of Deep Learning Theory by Daniel Roberts and Sho Yaida. Upon our first review of their work we considered the various explorations into overparameterization, like for the infinite width neural tangent kernel framing, as entirely complementary to the geometric regularization conjecture. Although we suspect that our “ratio of influence” metric may prove to be a more useful (more tractable in exotic network configurations) representation than the depth to width ratio discussed throughout that work, we don’t think it disagrees their work, we just consider their depth to width ratio as a less granular version of our metric. Several aspects of their work we consider as supportive or at least aligned with various aspects of the geometric regularization conjecture, for example:

  • The deeper a network, the less the infinite width Gaussian distributed weights description applies due to what they describe as “accumulation of fluctuations”, and eventually more depth causes perturbation theory to break down (pg 64)
  • The inductive bias of overparameterization can only be overcome below infinite width (pg 177) => we believe this is consistent with an asymptoptic zero volume convergence
  • Deeper layers have an inductive bias to build more neural associations (pg 181) => we believe this is consistent with the ratio of influence framing
  • We can actually fully train infinite-width networks in one theoretical gradient-descent step. That is, we can take a giant leap right to the minimum of the loss. (pg 252), which can be achieve independent of the applied algorithm (pg 257)
  • At infinite width the fully-trained mean network output is just a linear model based on random features. In this sense, infinite width neural networks are rather shallow in terms of model complexity, however deep they may appear (pg 289)
  • Complexity is not really captured by the counting of model parameters, instead it is expressed in the structure of the model (pg 322)
  • The output of a fully-trained network does depend on the details of the algorithm used for optimization. (pg 357)
  • A nearly-sparse model complexity is perhaps the most important inductive bias of deep learning (pg 399)
  • * Note that the book offers a full appendix section devoted to information theory in deep learning, we recommend this as a potential resource for a theoretic deep dive. (Pg 401)

As a note to these authors, in their questions surrounding empirical performance of ReLU activations in relation to tanh (pg 245), we suggest that our histogram experiments in the geometric regularization preprint may actually be uniquely illuminating on this point, as it was demonstrated that ReLU’s tend to result in a loss manifold histogram with a much larger density of low loss configurations than are presented by tanh.

Additional Related Work

We refer the reader also to other work that has sought to relate data scaling to parameterization from the likes of (Kaplan et al 2020, Hoffmann et al 2022, Zhai et al 2022) which are another tangent to this writeup. Such works have sought to challenge the conventional wisdom that increased parameterization is basically universally beneficial at any scale. Instead these works appear to demonstrate that one can seek to balance between scale of parameterization and number of training tokens in order to optimize compute cost of training verses performance.

We believe, as noted in the draft of our paper that was submitted to the ICML TAG workshop, that such works are aligned with the geometric regularization conjecture. After all the asymptotic trend of a flattening volume curve suggests a point of diminishing returns from overparameterization.

In Closing

In closing to this long and dense ramble of an essay, we offer to the reader that a big open question not answered by the geometric regularization conjecture is associated with yet another boundary condition. It has been established that infinitely wide networks converge to a linear kernel, and infinite parameterization converges to zero volume of weight distributions. That leaves with the third point in the trifecta. What would be the inductive bias found in an infinitely deep network of infinitely wide layers?

We offer to the reader in only a partially playful manner that perhaps this is the solution that Einstein was looking for all along. Perhaps an infinitely deep network of infinite width layers would converge to an inductive bias corresponding to the fundamental laws of physics.

Hey theoreticians, would you get on that please?

Cheers.

References

Baldi, P., Sadowski, P., and Whiteson, D. Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 5(1), jul 2014. doi: 10.1038/ncomms5308. URL https://doi.org/10.1038%2Fncomms5308.

Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. doi: 10.1073/pnas. 1903070116. URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116.

Cmglee. Hypersphere volume and surface area graphs, 2018. URL https://en.wikipedia.org/wiki/ File:Hypersphere_volume_and_surface_ area_graphs.svg.

Dhifallah, O. and Lu, Y. M. On the inherent regularization effects of noise injection during training, 2021. URL https://arxiv.org/abs/2102.07379.

Haferkamp, J., Faist, P., Kothakonda, N. B. T., Eisert, J., and Halpern, N. Y. Linear growth of quantum circuit complexity. Nature Physics, 18(5):528–532, mar 2022. doi: 10.1038/s41567–022–01539–6. URL https://doi.org/10.1038%2Fs41567–022–01539–6.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/2203.15556.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.

Larocca, M., Ju, N., Garćia-Martín, D., Coles, P. J., and Cerezo, M. Theory of overparametrization in quantum neural networks, 2021. URL https://arxiv.org/abs/2109.11676.

Lloyd, S. Measures of complexity: a nonexhaustive list. IEEE Control Systems Magazine, 21:7–8, 2001.

Roberts, D. A., Yaida, S., and Hanin, B. The Principles of Deep Learning Theory. Cambridge University Press, may 2022. doi: 10.1017/9781009023405. URL https://doi.org/10.1017%2F9781009023405.

Song, C., Ramezani-Kebrya, A., Pethick, T., Eftekhari, A., and Cevher, V. Subquadratic overparameterization for shallow neural networks. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=NhbFhfM960.

Teague, N. Dropping out for fun and profit, 2019. URL https://medium.com/from-the-diaries-of-john-henry/dropping-out-for-fun-and-profit-152a5539e616.

Teague, N. J. Stochastic perturbations of tabular features for non-deterministic inference with automunge, 2022a. URL https://arxiv.org/abs/2202.09248.

Teague, N. J. Geometric regularization from overparameterization explains double descent and other findings, 2022b. URL https://arxiv.org/abs/2202.09276.

Teague, N. J. Numeric encoding options with Automunge, 2022c. URL https://arxiv.org/abs/2202.09496.

Webb, A. Functional approximation by feedforward networks: A least-squares approach to generalization. Neural Networks, IEEE Transactions on, 5:363–371, 06 1994. doi: 10.1109/72.286908. URL https://ieeexplore.ieee.org/document/286908.

Witten, E. A mini-introduction to information theory. La Rivista del Nuovo Cimento, 43(4):187–227, mar 2020. doi: 10.1007/s40766–020–00004–5. URL https://doi.org/10.1007%2Fs40766–020–00004–5.

Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12104–12113, June 2022.

Zhao, S., Song, J., Choi, K., Kalluri, P., Han, Y., Jiao, J., Dimakis, A., Poole, B., Weissman, T., and Ermon, S. Neurips 2019 workshop on information theory and machine learning, 2019. URL https://sites.google.com/view/itml19.

Books that were referenced here or otherwise inspired this post:

The Principles of Deep Learning Theory — Daniel Roberts & Sho Yaida

The Principles of Deep Learning Theory

As an Amazon Associate I earn from qualifying purchases.

For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

--

--

Nicholas Teague
From the Diaries of John Henry

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.