Geometric Regularization from Overparameterization
An archive of the arXiv
Authored by Nicholas Teague in 2021, 2022 (without use of language models), formally distributed at arXiv:2202.09276.
Abstract
The volume of the distribution of weight sets associated with a loss value may be the source of implicit regularization from overparameterization due to the phenomenon of contracting volume with increasing dimensions for geometric figures demonstrated by hyperspheres. We introduce the geometric regularization conjecture and extract to an explanation for the double descent phenomenon by considering a similar property resulting from shrinking intrinsic dimensionality of the distribution of potential weight set updates available along training path, where if that distribution retracts across a volume verses dimensionality curve peak when approaching the global minima we could expect geometric regularization to re-emerge. We illustrate how data fidelity representational complexity may influence model capacity double descent interpolation thresholds. The existence of epoch and model capacity double descent curves originating from different geometric forms may imply universality of closed n-manifolds having dimensionally adjusted n-sphere volumetric correspondence.
1. Introduction
The aggregate geometry of weight configuration distributions corresponding to a loss value has been an unexplored property of neural networks to our knowledge, likely due to intractability of derivation at such high dimensions. If one could model the full geometry of a loss manifold then backpropagation would not be required. We attempt to circumvent that challenge by considering meta properties of distributional geometry that can be inferred independent of fine grained details.
A key contribution of this work is identifying the relationship between an extent of overparameterization and volume of such geometry by relating to a well understood property of hyperspheres, which have a zero volume asymptotic trend with increasing dimensionality. Such contracting volume should serve as a form of regularization by restricting degrees of freedom to weight sets along a training path, which we refer to as geometric regularization. We believe double descent is due to an additional correspondence to hypersphere volumes at lower dimensions associated with a peak in volume traversed when the distribution of possible weight updates available along a training path follows a path of shrinking intrinsic dimensionality at loss values approaching the global minima. We expect that advancing theory for derivation of an interpolation threshold may need to consider intrinsic dimension of a training corpus at different fidelities of representation.
2. Hyperspheres
Consider the equations for a three dimensional unit sphere: x^2+y^2+z^2=1
, where the volume is simply 4πr^3/3
, which for a unit sphere is 4π/3
. Now consider a hypersphere where we increase the number of dimensions governed by the similar formula w2+x2+y2+z2+...=1
. To visualize in an abstract way, consider the difference between a perfect sphere and a collection of fronds at the top of a palm tree. As may be a surprising finding, for hyperspheres both the volume and surface area briefly increase with increasing dimensions until they reach a peak, after which point they progressively shrink to an asymptote at zero [Fig 1]. The paradox of hyperspheres is that with this decreasing volume and surface area, the expected distance between two sampled points will actually increase with parameterization Tu & Fischbach (2002). Due to the curse of dimensionality, once a manifold starts to reach thresholds beyond order of 10 dimensions, evaluating fine-grained structure from random sampling becomes exponentially hard Erba et al. (2019). Mathematicians currently have better understanding of hyperspheres in comparison to other high dimensional objects, even for simple shapes like hypercubes Granata & Carnevale (2016), however this type of zero asymptotic volume convergence appears likely to arise when a shape is constrained through dimensional adjustment to a single scale, such as for a hypersphere could be the unit radius or for a loss manifold could be some loss value. We just don’t know where the peak of the volume curve would occur. It is possible the machine learning community may have found another framing that can approximate the location of such a peak by way of the double descent interpolation threshold.
One way to think about the loss function of a neural network is as an unconstrained formula with weights (and other hyperparameters) as the variables, for example a loss function J may be derived as a function of weights J(w)=?
(which abbreviation is meant to generalize across loss metrics like mean absolute error or cross entropy), and through backpropagation we are trying to minimize J(w)
. However when you consider that a fitness landscape will in general have a global minimum, the loss function through backpropagation is shifted in direction towards a minimum loss Lmin as J(w)→Lmin
. This also applies to any given value for L, that is for any given loss, the formula J(w)=L
is a constrained formula where each weight has some distribution of potential values associated with that loss, similar to how in a hypersphere there is some distribution of each variable associated with a specific radius. Thus J(w) can be approximated as a constrained formula around the weight set associated with the global minimum as well as for losses in the backpropagation states preceding reaching the global minimum, and where the volume of the distribution of weights are expected to contract as the loss approaches the global minimum as there become fewer weight sets capable of achieving better performance, and the volume will converge to a point (a single weight set) at the global minimum.
Gaps in loss manifold volume refer to the distinction for a training path of directional updates that result in increased loss from the prior epoch, which will not be considered a viable path barring some form of momentum or otherwise deviation from a pure gradient signal. The volume of a loss manifold can be considered in two framings: the volume of weight set distributions that are capable of realizing an exact loss value, or the volume of weight set distributions that are capable of achieving a loss lower than the prior epoch, in which case the second framing’s volume plus the volume of the corresponding gaps will equal the volume of the full range of initialization sampling. Note that an initialization sampled from a normal distribution means that there will be a (vanishingly small) probability of an infinite range of possible weight values sampled at initialization, thus when we talk about the volume of initialization sampling it needs a probabilistic element to be meaningful, e.g. volume of initialized weights for Pw>10^−10
.
A hypersphere aligned volumetric transience through increasing dimensions translated to our high dimensional loss function J(w)
is really just another way of saying that with increasing dimensions by parameterization the degrees of freedom available to each weight corresponding to a given loss value will be diminished, kind of similar to what happens with L1 regularization which promotes collective sparsity of a weight set Bengio (2012). However here we are not talking about the sparsity of a single collective weight set, more referring to sparsity of weight set distributions corresponding to a loss value (a loss manifold). This implies that individual weights will also result in, for a given weight wi
, for that distribution of wi
corresponding to a given loss value, the sparsity of that single weight’s distribution increasing with parameterization. With correlations, dimensionality’s influence to individual weight distribution sparsity will be harder to see than sparsity across weights.
The main idea of regularization theory is to restrict the class of admissible solutions by introducing a priori constraints on possible solutions Webb (1994). Thus, with increasing dimensionality, a trend toward decreasing volume and surface area of hyperspheres could imply a corresponding trend towards increasing sparsity of each weight’s distribution associated with a loss value, which would enforce a kind of regularization by constraining degrees of freedom for weight sets traversed through a training path, explaining the inherent regularization of overparameterized networks. [Appendix B] surveys related phenomena.
The preceding considers asymptotic dimensionality. The double descent phenomenon Belkin et al. (2019) may be associated with an additional correspondence to hypersphere volumes at lower dimensions where hyperspheres exhibit a peak in volume across dimensions [Fig 1]. Consider that the distribution of possible update steps available to an epoch after taking into account any stochasticity included in the training loop as a geometric figure with a dimensionality of its own. This differs from the geometric figure associated with the broader geometric regularization conjecture, which was simply the distribution volume of all weights associated with a loss value, as in this case we need to take account for the specific weight configuration at the point where a next epoch update step is considered for a training path, or more particularly the distribution of possible update steps available from each point, which distributions will have an intrinsic dimensionality Berezniuk et al. (2020) of their own varied along a training path tendril. We use the term “tendril” to suggest that as the training path reaches loss values approaching the point of global minimum in the loss manifold, the surrounding intrinsic dimensionality of this smaller geometric figure will continue shrinking until reaching an effective zero dimensions at the global minimum, as in the ideal case at global minimum the loss manifold distribution at J(w) = Lmin
, as a point, will have effectively zero dimensions. With correspondence to hyperspheres, at some point along that training path the intrinsic dimension of this distribution of possible gradient steps will retract across a peak in a volume to dimensions curve below which geometric regularization will re-emerge, visible when not masked by an adjacent regularizer, which explains the emergence of an epoch wise double descent.
A characteristic feature of double descent is the interpolation threshold for overparameterization which can be approximated as a minimum boundary where the number of parameters exceeds the number of training samples and below which threshold double descent does not manifest. We offer further speculative musings on origination in [4].
It has been observed that the global minimum of an overparameterized model may not be a single point, it actually transitions at some scale of parameters to becoming a submanifold — as in having a range of possible weight sets all sharing a common loss value as the minima of the optimization’s fitness landscape Weinan (2022). We suspect that there is a very simple explanation that arises from the conjunction of the geometric regularization phenomenon coupled with the practicality of numeric representations of weights, activations, and gradients that collectively parameterize a loss function: it isn’t that geometric regularization is shrinking the count of available function representations, it is just that it is squishing them to numeric values falling below the capacity of the data type representing these parameters. (Not squished to underflow territory, more resulting in delta updates from gradient steps falling below the step size available to increment a value within the precision capacity of a data type representation’s bit registers).
The generalized Poincaré Conjecture, proven for dimensions > 4 Smale (1961) and later extended to 4 Freedman (1982) and 3 Perelman (2002), suggests that every simply connected, closed n-manifold is homeomorphic to the n-sphere, which demonstrates that topologically related manifolds have one-to-one point correspondence. Increased regularization with overparameterization could be interpreted as empirically suggestive of a similar dimensionally adjusted volumetric correspondence. This is somewhat of a conjecture, but with supporting evidence of well understood hypersphere geometry, extensive empirical demonstration, and the reality of a double descent found in both epoch wise and model capacity wise curves, we believe it is the most credible hypothesis to date.
3. Related Work
It has been known for some time that hyperspheres have asymptotic volumetric trends with increasing dimensionality Smith & Vamanamurthy (1989). Some of our additional understandings of higher dimensioned geometries are less precise. Another channel of hypersphere theoretic research has been associated with finding optimal hypersphere packing densities at different dimensions, including notable recent results of proved optimal packing densities at 8 dimensions Viazovska (2017) and 24 dimensions Cohn et al. (2017). To our knowledge the primary published work from other authors seeking to relate properties of hyperspheres to neural network regularization was with respect to applying a constraint to weight updates in order to achieve hypersphere uniformity in a manner analogous to L2 regularization Liu et al. (2021), which is non-overlapping to our conjecture.
Overparameterization is commonly considered as the set of neural network architectures with number of weights exceeding the complexity threshold where the count of weights equals that of training samples, although distinctions such as mild overparameterization verses heavy overparameterization may also come into play. A notable unexpected property of the overparameterization regime is that the conventional wisdom for the bias-variance tradeoff in training appears to be contradicted, with emergence of an “epoch wise” double descent training curve in which progressing through epochs initially manifests overfit that reaches a peak before continued training results in a recovery of test performance to realize a better generalization than what was achieved prior to the overfit state Nakkiran et al. (2019). The interpolation threshold beyond which the double descent phenomenon appears can be illustrated by charting performance curves for train and test data as a function of model complexity capacity [Figure 2] Belkin et al. (2019), which exhibit what we call a “model capacity” double descent, directly related to but distinct from the epoch wise double descent. Model capacity is often approximated by a ratio of weight to sample counts.
The overparameterization convention appears to have several benefits. Empirical studies demonstrate reduced risk of overfit Li & Liang (2018) and remarkably small generalization error Zhang et al. (2017). The benefits appear to manifest across frameworks and modalities of application. Overparameterized models appear to result in smoother fitness landscapes with a smaller ratio of saddle points to global minima Simsek et al. (2021). The resulting models appear more robust to covariate shift, meaning distributional discrepancies between train and test data with retained label correlations Tripuraneni et al. (2021), and their interpolations are smoother with a smaller Lipschitz constant Bubeck & Sellke (2021). Although increasing parameters will have resource and latency impacts to inference, the resulting models can often be pruned with little or no cost to generalization Barsbey et al. (2021).
While there is often a material increase in time, cost, and complexity of training these models, the resulting performance characteristics appear to more than offset, and progressively higher thresholds of overparameterized transformers Vaswani et al. (2017) have led to natural language implementations with few shot learning capabilities like GPT-3 Brown et al. (2020) and several emerging foundation models since Bommasani et al. (2021).
It has been somewhat of a mystery to researchers the source of this phenomenon. It has been demonstrated empirically that size alone does not explain it, and that some form of capacity control or implicit regularization is at play Neyshabur et al. (2015). The phase transition to a double descent phenomenon has been directly linked to varying the ratio between number of parameters to samples in unregularized networks Derezinski et al. (2020). There appears to be some relevance to model initialization consideration as models tend to learn a network close to the initialized random weights Li & Liang (2018). Counter to classical stochastic optimization theory, these models appear to train better with a constant SGD learning rate without momentum Sankararaman et al. (2020). Perhaps even more perplexing, aspects of the phenomenon are not limited to neural networks, with a similar double descent curve and generalization benefits with increasing model complexity capacity being demonstrated in other paradigms like kernel methods, nearest neighbors Belkin et al. (2018), decision tree paradigms like random forest and gradient boosting Belkin et al. (2019), and quantum neural networks Larocca et al. (2021).
Noise injections to training features result in an increased threshold for number of parameters needed to reach the overparameterization regime [Figure 3] Dhifallah & Lu (2021), which we speculate is associated with additional perturbation vectors causing an increase to the intrinsic dimension of the modeled transformation function in a manner similar to data augmentation’s impact to intrinsic dimension Marcu & Prügel-Bennett (2021).
The practice has mostly positive impacts to performance, although it does have the possibility of lazy training where a model converges exponentially fast to zero training loss to recover a linear model Chizat et al. (2019), which may occur under some choices of hyperparameters. Properly sampled initializations benefit the optimization Sutskever et al. (2013) and help to avoid the undesirable lazy training phenomenon Stöger & Soltanolkotabi (2021). There have been reports of degradation of performance on under-represented subgroups, however this appears to resolve with better calibration to a classification output layer Menon et al. (2021).
Theoretical study of the phenomenon has followed several branches, this paper’s survey isn’t exhaustive. An influential channel of inquiry was to consider neural layers approaching the infinite width limit where the network’s modeled function become a Gaussian distributed process at initialization, which assumption underlies the neural tangent kernel equivalency Jacot et al. (2018) that can be used to represent networks as a kernel function. This equivalent kernel’s positive definiteness Fasshauer (2011) can be used to evaluate network convergence properties, although this finding alone may not be sufficient to explain the impact of overparameterization since it appears to abate in presence of skip connections or batch normalization Goldblum et al. (2020). Other researchers have attempted to reason about translations in fitness landscape properties between different parameter regimes, which are closer aligned to the theme of this work.
An overparameterized network is capable of learning any function represented by a corresponding network of fewer parameters Sun et al. (2021). This property appears to extend to networks of discrete activations learning the functions of smaller networks of smooth activations as has been proven for three layer networks modeling two layers Allen-Zhu et al. (2019). Networks have a universal connectivity property, so that if a modeled function may exist in parameter space it at least has the potential to be reached in training from a random initialization Draxler et al. (2018), which finding has also been extended to ReLU activations Freeman & Bruna (2017). It has been considered that a point in the fitness landscape of a network will translate to a corresponding manifold in the fitness landscape of a larger network Simsek et al. (2021), which observation is the closest we’ve seen in the literature to reasoning about geometry of loss manifold distributions as is the focus of this paper. An influential tangent line of inquiry has considered geometric properties of feature manifolds Bronstein et al. (2017).
There appears also to be some differences in whether the source of overparameterization is from increasing network width or depth. It has been observed that wider networks are easier to train Jacot et al. (2018), while deeper models have an implicit bias towards sparsity Gissin et al. (2020) and may exhibit loss manifolds with an increased prevalence of non convexities Li et al. (2018a). Expected parameterization needed to reach generalization has been considered higher for deeper than wider networks, although it has recently been suggested that a mild overparameterization can also be used for deep networks Chen et al. (2021). Deeper models have been shown to be more efficient at modeling higher complexity functions than shallow networks Eldan & Shamir (2016). Gradient confusion refers to negatively correlated gradients between mini-batches, which has been shown to trend higher with deeper networks Sankararaman et al. (2020).
Implementing overparameterization in practice involves several considerations. Theorists have suggested choosing a width based on the point at which learning algorithms can provably learn a zero loss in non convex training, and then if increasing the number of training samples the parameterization can be increased by widening layers in a corresponding manner Song et al. (2021). The duration of training may be balanced between scale of parameters and training tokens Zhai et al. (2022). Deeper networks may realize similar benefits to wider networks with a common degree of mild overparmeterization Chen et al. (2021). Discontinuous activations like ReLU are still appropriate and train faster than smooth activations Panigrahi et al. (2020). As noted above vanilla SGD with a constant learning rate has been found to outperform scheduled learning rate methods Sankararaman et al. (2020). The He initialization heuristic appears to lie at the boundary of the well-performing regime Song et al. (2021), which since the models tend to learn a model close to initialization is an important consideration. Note that overparmeterization can even be achieved by introducing intermediate linear layers that after training can be contracted algebraically to realize a more compact model Guo et al. (2020).
Recent work from Hoffmann et al. (2022) has demonstrated one can balance parameterization scale with number of training tokens in large language models for purposes of optimizing training-compute cost and performance. This is consistent with the geometric regularization conjecture as the asymptotic trend of a flattening volume curve may suggest a point of diminishing returns from overparameterization.
4. Interpolation Threshold
Please consider the dialogue of this section as more of a speculative nature. We provided in [2] a plausible conjecture for both the regularizing properties of overparameterized learning as well as a corresponding explanation for the epoch wise double descent phenomenon. Missing from the theory were considerations surrounding why the various phenomenon of overparameterized learning have been found to have a distinct boundary, known as the interpolation threshold, which is traditionally approximated as conditions where the number of model parameters exceed number of presented training samples Belkin et al. (2019), manifesting a model capacity double descent [Fig 2].
We have already pointed out in [3] that the ratio of number of parameters to number of training samples alone may be insufficient to universally derive an interpolation threshold as feature properties of additional noise perturbation vectors also play a role [Fig 3]. It has also been demonstrated by Roberts et al. (2022) that the dynamics of training a model is not just a function of parameter count but also of the architecture, e.g. considerations like depth and width. We offer in explanation that for two models of equal parameter count, the deeper model will have higher ratio of influence (ratio of parameter interactions / parameter count) between upstream and downstream weight updates during backpropagation, and that such channels of influence should contribute to increasing the complexity density of a neuron, and assuming a constant model complexity from parameter count this will translate to sparsity in weights, as such trends have been found for deeper architectures Gissin et al. (2020). We also suggest that although the ratio of parameter count to training samples may have become a useful heuristic, a formal definition for the threshold would need to also account for intrinsic dimensionality of the data generating function, the complexity capacity of the model, and the complexity of the presented function at a given fidelity of representation. The dominant impact of number of training samples to the threshold should saturate if data is fully represented, on which we now attempt to extrapolate.
Let us note a few properties of a more recent paradigm of learning, that associated with quantum neural networks Broughton et al. (2020) comprised of parameterized quantum circuits Farhi & Neven (2018). Consider that a quantum sensor by definition will give our network access to the full superposition within the scope of input, at least prior to any measurement collapse. This differs from classical learning where the fidelity of a data generating function presented to a network may have representational gaps due to collected training data not covering the full surface of this manifold. (Or to put in plain English, the training samples may not capture all of the scenarios applicable to label generation.)
Various phenomenon of overparameterization are not limited to classical neural nets, as we noted in [3] they also arise both in legacy paradigms of learning as well as next generation quantum neural networks. However some recent work applicable to quantum neural networks have found results that appear unique to the quantum setting, for which we will shortly try to find parallels in classical networks. Quantum neural networks have a characteristic curve [Figure 4] Larocca et al. (2021) between model effective dimension (demonstrated by quantum fisher information matrix rank) and the number of parameterized qubits.
One may expect a similar curve for classical networks with the saturation point aligned to the interpolation threshold, however in the classical setting the spectrum of the Fisher information matrix is often more degenerate than quantum networks with more low magnitude eigenvalues, conditions that mainly arise in quantum networks with barren plateaus Abbas et al. (2021). We expect this is associated with different fidelities of received signals between quantum and classical learning. Before losing correlation at smallest scales, decreasing the number training samples in a classical corpus should present fidelities resembling a progression of the 2nd law of thermodynamics from a disorder standpoint and a negative “coffee progression” from a representational complexity standpoint; consider Appendix B.4 of Abbas et al. (2021) in conjunction with Figure 2 of Aaronson et al. (2014).
We hypothesize a data fidelity double descent curve based on varied scale of training data with fixed architecture [Fig 5] which assumes overlapping convex or concave inflection points in fidelity and disorder. Precise features will likely be harder to manifest than a model capacity curve [Fig 2] because the contribution of removing a training sample from a corpus is reliant on composite distribution of other samples, so producing a uniform fidelity degradation would require either a sophisticated synthetic data convention or otherwise a scheduled sample retraction based on distribution in some manner. In practice the performance curves should demonstrate a stochastic progression, with peak overfit partly masked by the adjacent performance benefits of more training data preceding saturation. The initial complexity peak arises from spurious dimensions from gaps in fidelity of the training data. With a decrease in regularization at the interpolation peak model capacity that was applied to spurious dimensions will be diverted to overfit, and with re-emergence of geometric regularization at increasing data scales model capacity is applied to data complexity.
The concurrence of a model capacity and epoch wise double descent is suggestive to the universality of geometric regularization. Whether an interpolation threshold may be approximated by a ratio of parameter to sample counts or some combination of model capacity and data dimension at a given fidelity, it is arising from a different geometric form than the distribution of weights in a loss manifold. A remaining question is what may be a constraining constant through dimensional adjustment analogous to a hypersphere’s radius or some loss function’s loss value, as we expect a bottleneck is needed for n-sphere volumetric correspondence. We suggest this arises from a data representation’s complexity at a presented fidelity in relation to a volume of diversity found in a model’s distribution of representational forms, as the volume of a model’s capacity for representational diversity will contract with overparameterization towards inherent generalization bias.
5. Conclusion
This paper has introduced the geometric regularization conjecture, which is associated with volume contraction with increasing parameterization of the distribution of possible weight sets associated with a loss value, and was inferred based on related properties demonstrated by hyperspheres. Geometric regularization would explain several phenomenon seen with overparameterized learning that have puzzled researchers. We believe double descent is a result of a training path reaching a loss value sufficiently close to a global minima that the distribution of possible weight set update steps corresponding to points reached in backpropagation will have a shrinking intrinsic dimensionality transience realized along the training path, resulting in this distribution retracting across its own peak in a volume verses dimensionality curve, after which there should arise a phase change as geometric regularization re-emerges when not masked by an adjacent regularizer. An interpolation threshold may need to account for complexity of data representation at different fidelities.
As many high dimensional properties are less tractable in today’s theory, we expect the emergence of an empirical double descent curve with approximation methods for an associated interpolation threshold could become an alternate channel for mathematicians to investigate high dimensioned functions in future research. We explore additional weight distribution transience characteristics via loss manifold histograms in [Appendices D — I].
References
Aaronson, S., Carroll, S. M., and Ouellette, L. Quantifying the rise and fall of complexity in closed systems: The coffee automaton, 2014. URL https://arxiv.org/abs/1405.6903.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
Abbas, A., Sutter, D., Zoufal, C., Lucchi, A., Figalli, A., and Woerner, S. The power of quantum neural networks. Nature Computational Science, 1(6):403–409, jun 2021. doi: 10.1038/s43588–021–00084–1. URL https://doi.org/10.1038%2Fs43588-021-00084-1.
Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/62dad6e273d32235ae02b7d321578ee8-Paper.pdf.
Ba, J., Erdogdu, M., Suzuki, T., Wu, D., and Zhang, T. Generalization of two-layer neural networks: An asymptotic viewpoint. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1gBsgBYwH.
Baldi, P., Sadowski, P., and Whiteson, D. Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 5(1):4308, 2014. doi: 10.1038/ncomms5308. URL https://doi.org/10.1038/ncomms5308.
Barsbey, M., Sefidgaran, M., Erdogdu, M. A., Richard, G., and Simsekli, U. Heavy tails in SGD and compressibility of overparametrized neural networks. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=ErNCn2kr1OZ.
Belkin, M., Ma, S., and Mandal, S. To understand deep learning we need to understand kernel learning. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 541–549. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/belkin18a.html.
Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116 (32):15849–15854, 2019. ISSN 0027–8424. doi: 10.1073/pnas.1903070116. URL https://www.pnas.org/content/116/32/15849.
Bengio, Y. Practical recommendations for gradient-based training of deep architectures, 2012. URL https://arxiv.org/abs/1206.5533.
Berezniuk, O., Figalli, A., Ghigliazza, R., and Musaelian, K. A scale-dependent notion of effective dimension, 2020. URL https://arxiv.org/abs/2001.10872.
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L., Goel, K., Goodman, N., Grossman, S., Guha, N., Hashimoto, T., Henderson, P., Hewitt, J., Ho, D. E., Hong, J., Hsu, K., Huang, J., Icard, T., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti, S., Keeling, G., Khani, F., Khattab, O., Koh, P. W., Krass, M., Krishna, R., Kuditipudi, R., Kumar, A., Ladhak, F., Lee, M., Lee, T., Leskovec, J., Levent, I., Li, X. L., Li, X., Ma, T., Malik, A., Manning, C. D., Mirchandani, S., Mitchell, E., Munyikwa, Z., Nair, S., Narayan, A., Narayanan, D., Newman, B., Nie, A., Niebles, J. C., Nilforoshan, H., Nyarko, J., Ogut, G., Orr, L., Papadimitriou, I., Park, J. S., Piech, C., Portelance, E., Potts, C., Raghunathan, A., Reich, R., Ren, H., Rong, F., Roohani, Y., Ruiz, C., Ryan, J., Ré, C., Sadigh, D., Sagawa, S., Santhanam, K., Shih, A., Srinivasan, K., Tamkin, A., Taori, R., Thomas, A. W., Tramèr, F., Wang, R. E., Wang, W., Wu, B., Wu, J., Wu, Y., Xie, S. M., Yasunaga, M., You, J., Zaharia, M., Zhang, M., Zhang, T., Zhang, X., Zhang, Y., Zheng, L., Zhou, K., and Liang, P. On the opportunities and risks of foundation models, 2021. URL https://arxiv.org/abs/2108.07258.
Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017. doi: 10.1109/MSP.2017.2693418.
Broughton, M., Verdon, G., McCourt, T., Martinez, A. J., Yoo, J. H., Isakov, S. V., Massey, P., Halavati, R., Niu, M. Y., Zlokapa, A., Peters, E., Lockwood, O., Skolik, A., Jerbi, S., Dunjko, V., Leib, M., Streif, M., Von Dollen, D., Chen, H., Cao, S., Wiersema, R., Huang, H.-Y., McClean, J. R., Babbush, R., Boixo, S., Bacon, D., Ho, A. K., Neven, H., and Mohseni, M. Tensorflow quantum: A software framework for quantum machine learning, 2020. URL https://arxiv.org/abs/2003.02989.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Bubeck, S. and Sellke, M. A universal law of robustness via isoperimetry. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=z71OSKqTFh7.
Chen, T.-F., Jiang, J.-H. R., and Hsieh, M.-H. Partial equivalence checking of quantum circuits, 2022. URL https://arxiv.org/abs/2208.07564.
Chen, Z., Cao, Y., Zou, D., and Gu, Q. How much over-parameterization is sufficient to learn deep relu networks? In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=fgd7we_uZa6.
Chizat, L., Oyallon, E., and Bach, F. On lazy training in differentiable programming. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ae614c557843b1df326cb29c57225459-Paper.pdf.
Cmglee. Hypersphere volume and surface area graphs, 2018. URL https://en.wikipedia.org/wiki/File:Hypersphere_volume_and_surface_area_graphs.svg.
Cohn, H., Kumar, A., Miller, S., Radchenko, D., and Viazovska, M. The sphere packing problem in dimension $24$. Annals of Mathematics, 185(3), may 2017. doi: 10.4007/annals.2017.185.3.8. URL https://doi.org/10.4007%2Fannals.2017.185.3.8.
Derezinski, M., Liang, F. T., and Mahoney, M. W. Exact expressions for double descent and implicit regularization via surrogate random design. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 5152–5164. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/37740d59bb0eb7b4493725b2e0e5289b-Paper.pdf.
Dhifallah, O. and Lu, Y. On the inherent regularization effects of noise injection during training. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 2665–2675. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/dhifallah21a.html.
Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. Essentially no barriers in neural network energy landscape. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1309–1318. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/draxler18a.html.
Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In Feldman, V., Rakhlin, A., and Shamir, O. (eds.), 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pp. 907–940, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR. URL https://proceedings.mlr.press/v49/eldan16.html.
Erba, V., Gherardi, M., and Rotondo, P. Intrinsic dimension estimation for locally undersampled data. Scientific Reports, 9(1), Nov 2019. ISSN 2045–2322. doi: 10.1038/s41598–019–53549–9. URL http://dx.doi.org/10.1038/s41598-019-53549-9.
Farhi, E. and Neven, H. Classification with quantum neural networks on near term processors, 2018. URL https://arxiv.org/abs/1802.06002.
Fasshauer, G. E. Positive definite kernels: past, present and future. Dolomites Research Notes on Approximation, 4:21–63, 2011. URL http://www.math.iit.edu/~fass/PDKernels.pdf.
Freedman, M. H. The topology of four-dimensional manifolds. Journal of Differential Geometry, 17(3):357–453, 1982. doi: 10.4310/jdg/1214437136. URL https://doi.org/10.4310/jdg/1214437136.
Freeman, C. D. and Bruna, J. Topology and geometry of half-rectified network optimization, 2017. URL https://arxiv.org/abs/1611.01540.
Gissin, D., Shalev-Shwartz, S., and Daniely, A. The implicit bias of depth: How incremental learning drives generalization. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1lj0nNFwB.
Goldblum, M., Geiping, J., Schwarzschild, A., Moeller, M., and Goldstein, T. Truth or backpropaganda? an empirical investigation of deep learning theory. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HyxyIgHFvr.
Goodfellow, I. J., Bengio, Y., and Courville, A. Deep Learning. MIT Press, Cambridge, MA, USA, 2016. http://www.deeplearningbook.org.
Granata, D. and Carnevale, V. Accurate estimation of the intrinsic dimension using graph distances: Unraveling the geometric complexity of datasets. Scientific Reports, 6:31377, 08 2016. doi: 10.1038/srep31377. URL https://rdcu.be/cFOZg.
Guo, S., Alvarez, J. M., and Salzmann, M. Expandnets: Linear over-parameterization to train compact convolutional networks. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1298–1310. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/0e1ebad68af7f0ae4830b7ac92bc3c6f-Paper.pdf.
He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, 2015. URL https://arxiv.org/abs/1502.01852.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/2203.15556.
Holmes, Z., Arrasmith, A., Yan, B., Coles, P. J., Albrecht, A., and Sornborger, A. T. Barren plateaus preclude learning scramblers. Physical Review Letters, 126(19), may 2021. doi: 10.1103/ physrevlett.126.190501. URL https://doi.org/10.1103%2Fphysrevlett.126.190501.
Howard, J. and Gugger, S. Fastai: A layered API for deep learning. Information, 11(2):108, feb 2020. doi: 10.3390/info11020108. URL https://doi.org/10.3390%2Finfo11020108.
Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.
Kuketayev, A. Probability density function of the cartesian x-coordinate of the random point inside the hypersphere, 2013. URL https://arxiv.org/abs/1306.0290.
Kumar, S. K. On weight initialization in deep neural networks, 2017. URL https://arxiv.org/abs/1704.08863.
Larocca, M., Ju, N., García-Martín, D., Coles, P. J., and Cerezo, M. Theory of overparametrization in quantum neural networks, 2021. URL https://arxiv.org/abs/2109.11676.
Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018a. URL https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf.
Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc, 2018. URL https://proceedings.neurips.cc/paper/2018/file/54fe976ba170c19ebae453679b362263-Paper.pdf.
Li, Y., Ma, T., and Zhang, H. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Bubeck, S., Perchet, V., and Rigollet, P. (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 2–47. PMLR, 06–09 Jul 2018b. URL https://proceedings.mlr.press/v75/li18a.html.
Liu, W., Lin, R., Liu, Z., Xiong, L., Schölkopf, B., and Weller, A. Learning with hyperspherical uniformity, 2021. URL https://arxiv.org/abs/2103.01649.
Marcu, A. and Prügel-Bennett, A. On data-centric myths. In Data-Centric AI Workshop (NeurIPS), 2021. URL https://arxiv.org/abs/2111.11514.
Menon, A. K., Rawat, A. S., and Kumar, S. Overparameterisation and worst-case generalisation: friend or foe? In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=jphnJNOwe36.
Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Fürnkranz, J. and Joachims, T. (eds.), Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814, 2010.
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. Deep double descent: Where bigger models and more data hurt, 2019. URL https://arxiv.org/abs/1912.02292.
Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. In ICLR (Workshop), 2015. URL http://arxiv.org/abs/1412.6614.
Panigrahi, A., Shetty, A., and Goyal, N. Effect of activation functions on the training of over-parametrized neural nets. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgfdeBYvH.
Perelman, G. The entropy formula for the ricci flow and its geometric applications, 2002. URL https://arxiv.org/abs/math/0211159.
Roberts, D. A., Yaida, S., and Hanin, B. The Principles of Deep Learning Theory. Cambridge University Press, may 2022. doi: 10.1017/9781009023405. URL https://doi.org/10.1017%2F9781009023405.
Sankararaman, K. A., De, S., Xu, Z., Huang, W. R., and Goldstein, T. The impact of neural network overparameterization on gradient confusion and stochastic gradient descent. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 8469–8479. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/sankararaman20a.html.
Simsek, B., Ged, F., Jacot, A., Spadaro, F., Hongler, C., Gerstner, W., and Brea, J. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 9722–9732. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/simsek21a.html.
Smale, S. Generalized poincare’s conjecture in dimensions greater than four. Annals of Mathematics, 74(2):391–406, 1961. ISSN 0003486X. URL http://www.jstor.org/stable/1970239.
Smith, D. J. and Vamanamurthy, M. K. How small is a unit ball? Mathematics Magazine, 62(2): 101–107, 1989. doi: 10.1080/0025570X.1989.11977419. URL https://doi.org/10.1080/0025570X.1989.11977419.
Song, C., Ramezani-Kebrya, A., Pethick, T., Eftekhari, A., and Cevher, V. Subquadratic over-parameterization for shallow neural networks. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=NhbFhfM960.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56): 1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html.
Stöger, D. and Soltanolkotabi, M. Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=rsRq--gsiE.
Sun, Y., Narang, A., Gulluk, H. I., Oymak, S., and Fazel, M. Towards sample-efficient overparameterized meta-learning. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=-KU_e4Biu0.
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/sutskever13.html.
Teague, N. J. Numeric encoding options with Automunge, 2020. URL https://arxiv.org/abs/2202.09496.
Tripuraneni, N., Adlam, B., and Pennington, J. Overparameterization improves robustness to covariate shift in high dimensions. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=PxMfDdPnTfV.
Tu, S.-J. and Fischbach, E. Random distance distribution for spherical objects: general theory and applications to physics. Journal of Physics A, 35:6557–6570, 2002. URL https://arxiv.org/abs/math-ph/0201046.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Viazovska, M. The sphere packing problem in dimension $8$. Annals of Mathematics, 185(3), may 2017. doi: 10.4007/annals.2017.185.3.7. URL https://doi.org/10.4007%2Fannals.2017.185.3.7.
Webb, A. Functional approximation by feedforward networks: A least-squares approach to generalization. Neural Networks, IEEE Transactions on, 5:363–371, 06 1994. doi: 10.1109/72.286908. URL https://ieeexplore.ieee.org/document/286908.
Weinan, E. Towards a mathematical theory of machine learning, 2022. URL https://icml.cc/virtual/2022/invited-talk/18430.
Wolfram, S. A New Kind of Science. Wolfram Media, May 2002. ISBN 1579550088. URL https://www.wolframscience.com/nks/.
Wolfram, S. (ed.). An Elementary Introduction to the Wolfram Language. Wolfram Media, 2015. URL https://wolfram.com/.
Wolfram, S. A Project to Find the Fundamental Theory of Physics. Wolfram Media, Incorporated, 2020. ISBN 9781579550356. URL https://www.wolframphysics.org/.
Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12104–12113, June 2022.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Sy8gdB9xx.
Appendix A. Table of Contents
- Appendix B: Other related phenomenon
- Appendix C: Saturated data
- Appendix D: Histograms
- Appendix E: ReLU activations
- Appendix F: Width verses depth
- Appendix G: Tail region
- Appendix H: Initialization scaling
- Appendix I: Additional training samples
- Appendix J: Speculative speculations
Appendix B. Other related phenomenon
This appendix offers a brief speculative survey of how the geometric regularization conjecture might be related to other phenomenon observed with overparameterization.
- Retained proximity to initialization Li & Liang (2018): The volume contraction of a weight distribution manifold in overparameterization if aligned to a hypersphere should converge at the infinite dimensional case to a projection in a one-dimensional world as a single point Kuketayev (2013), explaining initialization proximity. It is a kind of paradox of the curse of dimensionality that with this convergence, the expected euclidean distance between two sampled points will increase with dimensionality.
- Smoother fitness landscape Simsek et al. (2021): Borrowing the analogy that a point on a fitness landscape will translate to a manifold on a higher dimensional landscape [3], if the points on the higher dimensional fitness landscape are spread further apart Tu & Fischbach (2002), then the landscape should appear to have a smoother characteristic when comparing comparably scaled weight deltas to an underparameterized model, which possibly explains the benefit of constant learning rates Sankararaman et al. (2020). The paradox of a point translating to a manifold of decreased volume also suggests some form of contraction from a manifold of similar points on an underparameterized landscape into aggregated smaller volume manifold with overparmeterization. The observation of reduced concentration of saddle points in an overparameterized fitness landscape Simsek et al. (2021) suggests these have a higher prevalence of consolidations.
- Regularization dampening double descent Belkin et al. (2019): We suspect the result that regularization dampens the interpolation peak at a minimum is associated with a more consistent degree of regularization through training, as geometric regularization may not be a dominant feature until the training path reaches well into the left tail of the loss manifold distributions. Without a secondary regularization at play a training path in effect will experience a phase transition from no regularization to dominant geometric regularization near the interpolation peak.
- Dropout regularization Srivastava et al. (2014): Our discussions related to wide vs deep networks may be relevant to dropout regularization. Consider that when randomly dropping neurons, the network is then channeling backpropagation in a path consistent with what would be realized for a narrower width network with an increased ratio of influence, with implications for sparser representations.
- Other learning paradigms Belkin et al. (2018): Our comparison of the hypersphere to a set of neural network weight distributions associated with a loss value we expect will be equally valid for any learning paradigm in which a large number of parameters are tuned in the direction of a decreasing loss signal toward a global minimum. In any form, the volume of the distribution of weight configurations should constrict with increased dimensionality, and an epoch wise double descent may manifest when weights traverse a tendril of shrinking effective dimensions.
- Gradient confusion Sankararaman et al. (2020): With a thicker left tail in the histogram space [D] as well as a tendency for reduced sparsity, the set of weight configurations that can approximate a modeled function will be larger for wider networks, and the corollary is that deeper networks will have a greater density of gaps in the weight distribution manifold associated with loss values exceeding that realized in the prior epoch, such that the statistical variations between mini-batches may result in diverting the direction of a training path due to proximity to one of these gaps, causing increased gradient confusion for deeper networks.
Appeneix C. Saturated data
For empirical evaluation of the interpolation threshold without dominant influence of sample count we expect a data set represented in higher fidelities would help. We offer that for the classical setting, the tabular modality may have a unique potential to have data generating functions presented to a network by a training corpus in a fully represented manner at reasonable scales. Compare to the image modality, where features have a near infinite range of light sources, rotation angles, camera angles, or object compositions. Fully representing such complexity through a training corpus may take orders of magnitude more samples than tabular, with a model’s complexity capacity otherwise diverted into spurious dimensions.
As what may be considered at least suggestive evidence to demonstrate saturation of a data set representation by a training corpus, consider the tabular data augmentation by noise injection benchmarking applied to the Higgs data set Baldi et al. (2014) with a fastai learner Howard & Gugger (2020) detailed in Teague (2020). The authors found that when presenting the entire training corpus to a neural network, the benefit of data augmentation by noise injection was negligible. However when the training corpus was pared down to decreasing scale of samples, the benefits of data augmentation by noise injection appeared to grow proportionally to reductions in scale. This suggests that the Higgs data set may be a useful resource for researchers seeking to consider a fully represented classical training corpus in further inquiries on this matter.
Appendix D. Histograms
We sought to visualize loss manifold geometries by exploring meta properties of weight set distributions with a kind of monte carlo evaluation to derive histograms of binary cross-entropy loss values realized from randomly initialized weight sets. We haven’t seen loss manifold histograms considered in this manner by the literature, where varied setups were used to explore patterns that became apparent at the smallest scales. We drew some inspiration from Stephen Wolfram’s explorations of cellular automata and hypergraphs, in which he surveyed and catalogued various patterns that arose across simulated configurations. Such exploration is the only intent of these appendices, as we found the exercise helped us gain intuitions. We consider most of these explorations as tangential to the scope of the paper rather than directly supportive of the main dialogue.
The histograms were prepared in a series of jupyter notebooks on the Colaboratory platform. A representative notebook is provided with the supplemental material, where each of the setups had some degree of variation over this template. The network architecture was initially modeled as formulas in the cells of a spreadsheet, we soon found we could incorporate more elaborate architecture conventions and larger depth with the support of Tensorflow Abadi et al. (2015). We used samples from the Titanic data set as features. The notebook effectively demonstrates the initialization of a small network based on specification of width, depth, activation, and initialization type. Once a network was initialized with sampled weights, the features were passed to a predict operation similar to what would be performed on a trained model in inference, however in this case the predict was applied on the initialized weights without any form of training. The output of the predict and the corresponding labels were then fed to a binary cross entropy loss evaluation (without the from_logits
option) to realize a single loss value recorded as one count within the histogram. Histogram aggregation involved repeating this setup a number of times based on either a designated sample count or in some cases running samples for duration of the Colaboratory 24 hour run time window.
We limited the feature count to three, with one numeric and two categoric, because we expected that the more features that were applied, the larger the scale of sampling would be required to get reasonable representation in histogram tail regions. For similar reasons, we focussed most of our inspections on loss values calculated for a single sample. We acknowledge this is borderline trivial territory with only three features, we found that even in this simplest of setups there were still characteristic patterns that emerged that we hope may be further investigated at more extensive scales by us or others with additional resources in future work.
Although these histograms didn’t reveal the distribution volume itself, they do demonstrate the relative volumes between different loss values for a given network configuration as a loss with higher number of possible weight configurations should stochastically demonstrate an increased probability of representation from random sampling, with improved fidelity from an increased number of samples. By comparing trends across histograms in varied network configurations, we hoped to infer aspects of distribution property effects realized from different degrees of overparameterization, width, depth, activation functions, initializations, and etc. At the scale of sampling we applied these histograms often didn’t include representation of weight configurations associated with global minimum, however in many cases, and especially when considering loss from a single training sample, there were characteristic shapes and trends demonstrated in the left tail of the distribution which is the area of most interest for considerations of double descent.
A recurring characteristic was the presence of a central mode in the distribution (visible as a peak), which appeared to universally align with the loss value realized from a 0.5 sigmoid output, it turned out the peak was a relic from the use of ReLU activations which at such small widths may often return all zero values in a preceding layer. Subsequent experiments with an increased number of features demonstrated that the number of sampled peaks appeared to correlate with number of features. Although some of the peaks phased in and out with small variations of parameters and configurations in a manner resembling the appearance of harmonics, we suspect the effect was from bin boundaries established by sampled minimum, and in aggregate trends were still visible.
When we ran comparable setups with tanh for a smooth activation, the mode transitioned from one or more peaks to a single moded curve. In many cases for ReLU activations a second mode would appear. We speculate that the zero value secondary mode, as visible in Figure 6, might be associated with the phenomenon of lazy training Chizat et al. (2019) noted above. In a small number of cases a second mode also appeared in the right tail.
We focussed most of our attention on histograms derived from a single sample to promote trend visibility. Two characteristic patterns appeared, one with the central mode and a reduced left tail [Fig 6], and the second with progressive volume towards a dominant zero mode [Fig 7]. In some cases these two conventions appeared with the same network configuration when evaluated against different training samples without He scaling. We expect these two cases may be aligned with what has been shown by other researchers when training on a single sample of having two scenarios of a model either memorizing or learning a representation Zhang et al. (2017).
The distinction between evaluations of a single training sample verses multiple training samples was notable. When loss was averaged over multiple training samples the left tail representation (for values below the primary mode) was greatly diminished, in most cases not visible at our sampling rate [Fig 8], although we could still confirm the tail existed for a given architecture by training the model for a few epochs, which would realize a loss value below the minimum sampled in the histogram. This is consistent with expectations that a weight set that can represent transformations of multiple samples is much rarer than one representing a single sample. That this kind of central mode dominated distribution would arise even when aggregating loss across samples with majority zero mode dominant distributions like Figure 7 suggests that the zero mode dominant distributions for each sample have weight set distributions that are mostly non-overlapping.
We did not find that the aggregate histograms strongly aligned with any of the traditional left tail bounded distributions like lognormal, gamma, or Weibull (we evaluated a few with statistical tests using the Wolfram Language Wolfram (2015) by deriving distribution parameters with FindDistributionParameters
and then deriving a p-value with DistributionFitTest
). However they still demonstrated some characteristic features of single mode distributions, and after averaging across multiple samples any secondary modes appeared to contract towards the central until losing visibility [Appendix I]. Among those features was the presence of a single mode, what appeared to be an unbounded right tail, and a bounded left tail. Again this left tail in many cases became invisible to our sampling rate, however we could infer from assessing a loss value after inference from a comparable architecture trained to overfit and the connectivity principle Draxler et al. (2018) that such a tail exists.
One way to think about what is taking place with the histograms is that we are compressing the geometry of loss manifold distributions down to a binary cross entropy comparison between sampled state and a designated lowest energy state, where lowest energy refers to the case where the sampled state’s transformation function matches the natural label generating function at a global minima Lmin. Note that this lowest energy state is not an inherent property of the geometry, it is low energy in comparison to some targeted label generating function likely of different intrinsic dimensions than the capacity of the weight set. The sampled loss density is a kind of proxy for volume, and we can infer by the shape of the histogram curve which loss values will have larger manifold volumes relative to other loss values from the same architecture and initialization, as well as geometry transitions (i.e. surrounding volume expansion or contraction) that will be seen by an optimizer along a training path.
Appendix E. ReLU activations
The distinction between ReLU Nair & Hinton (2010) and tanh was noticeable [Fig 9]. Relu exhibited sharp peaks while tanh was more of a smooth curve with only one visible mode. Tanh was more stable with architecture variations. In some cases Relu would shift noticeably while the corresponding tanh would be hard to distinguish [Fig 10]. However when we modeled architectures approaching infinite width, and especially with shallower networks, a slight zero mode shift could be seen in the tanh [Fig 10]. The higher density of low loss scenarios with ReLU offers an explanation for empirical benefits whose mechanism has been somewhat of a mystery Roberts et al. (2022). We expect the cause of the higher density is that by making a zero activation a range instead of a point, it increases the distributional density of narrower width models in the loss manifold classical superposition, increasing the diversity of representational capacity available to a model. The discontinuity in the activation in lieu of continuous curve allows for increased diversity of adjacent representational forms available to an epoch.
Appendix F. Width verses depth
Deviations of width and depth produced inverse directions in transitions from regimes similar to Figure 6 and Figure 7. Histograms with the dominant central mode shifted density in direction of the dominant zero mode with increasing width, and more depth shifted in the counter direction of increased density to the central mode. The trends endured with inverted labels or batch normalization.
Consider network architectures approaching infinite width. The neural tangent kernel framing Jacot et al. (2018) suggests that these will converge to weights modeling a Gaussian distributed process at initialization. This known property of the modeled function may become an interesting channel for researchers to relate histogram properties to properties of a resulting function. When we modeled large width scenarios in comparison to corresponding parameterized deeper models with a tanh activation, which didn’t exhibit the absolute zero mode like ReLU, we found the dominant mode also trended leftward when approaching asymptotic width.
This appendix offers a few representative examples implications of wide verses deep networks. [Fig 11] demonstrates the transition from increasing width of a three layer network through 6, 9, and 12 neurons, demonstrating the characteristic shift towards a dominant zero mode. [Fig 12] demonstrates a similar transition through increasing depth.
Appendix G. Tail Region
It is probably worth reiterating that most of these discussions so far have focused on histograms derived from single training samples. As additional samples are added to the inference basis of the loss values, the left tail of the histogram distribution quickly shrinks to invisibility for our extent of sampling which aligns with intuition. Even though we don’t have visibility of this tiny left tail, we can infer properties from what we have demonstrated takes place in loss manifold distributions on single training samples, after all the aggregate binary cross entropy loss value is derived from the average of the loss from each training sample. Thus if we can identify trends in the histograms of single training samples, we can expect they will also manifest in the invisible left tail of the aggregate histogram across all training samples.
We found from the depth and width experiments that wider networks will have a greater proportion of low loss value weight configurations available than equivalently parameterized deep networks, which also aligns with the ratio of influence framing [4]. What we didn’t know is what the aggregate loss manifold volume was in comparisons between wider and deeper networks at a common loss value. Just because a deeper network has a stronger representation in the central mode region of sampled loss values, it might still have similar volumes of loss manifolds in the left tail in comparison to wider networks. The histograms only reveal relative distribution volume relationships associated with different loss values for a common architecture within the same sampling operation. We attempted to circumvent this challenge by continued sampling of single sample configurations until reaching a common threshold for number of sampled left region values between configurations in order to evaluate the region in isolation. This approach yielded a similar pattern, with increasing depth causing a transition from zero mode to central mode dominated characteristics [Fig 13, 14].
This finding suggests that for aggregated loss values across multiple training samples, where the left tail region becomes invisible to our sampling assessment and these zero mode verses central dominated profiles average out across samples to a central mode profile, the path followed by backpropagation towards minimum loss will traverse through a different profile of loss manifold geometry transitions in wider verses deeper networks. We expect the left tail region of wider networks will thus have thicker tails in this histogram space than corresponding deeper networks, suggesting that the geometric regularization experienced by deeper networks will be of greater intensity once reaching sufficient depth into the left tail assuming that an increased number of manifold gaps doesn’t obstruct the path. We expect that the convention that thicker tails correspond to increased kurtosis may not apply for this consideration since the emergence of tail thickness is associated with presence of a second mode in the aggregated loss values as opposed to a dispersion from central mode. The idea that deeper networks will have greater constraints on weight configurations is also consistent with deeper networks trending towards more sparse weight configurations as seen by Gissin et al. (2020).
We thus believe that wide and deep architectures may offer a tradeoff between number of minima and complexity capacity of the network as evidenced by zero mode dominance and trends toward sparsity.
Appendix H. Initialization Scaling
Properly scaled initialization benefits optimization Sutskever et al. (2013), and He initialization He et al. (2015), which scales initialized sampling based on architecture of dimensions of inputs to a layer, appears to lie at the boundary of the well-performing regime Song et al. (2021).
Note that He differs from Xavier initialization by not including the layer output dimensions in the scale derivation’s denominator, resulting in a larger scale, and has been demonstrated as more suitable for nonlinear activations like ReLU Kumar (2017). He initialization samples from Gaussian, or a similar scaling may be adapted for sampling from a Uniform distribution. One of our experiments involved aggregating histograms with Normal and Uniform, and then comparing each again to other scales. Interestingly, He scaled Normal [Fig 6] and Uniform shared a similar appearance of histogram characteristics, which for the demonstration architecture exhibited a balance between central and zero mode, however when we increased the scale of Normal and compared to a corresponding Uniform variation, it appeared that Uniform was more quick to shift into a dominant zero mode [Fig 15], suggesting that Normal is more stable than Uniform.
We interpret the alignment of increased initialization scale of shifting the histogram towards dominant zero mode having a similar effect to widening layers as suggesting that networks have a larger proportion of low loss weight configurations when the input to layers have greater range of activations. The introduction of batch normalization did not appear to change this property, confirming that it was not arising from weight magnitude but spread. Increasing initialization scale is known to help break symmetry between units Goodfellow et al. (2016) and impact the implicit regularization of gradient descent Ba et al. (2020). The paradox of increased regularization with scale Li et al. (2018b) in conjunction with a zero mode dominated tail possibly suggests for wider networks, which also approach a zero mode, the minima may be more spread out in the loss manifold than for deeper networks. One might expect that such mode balance could be used to align width and depth configurations when adding more training samples, however this has been demonstrated as less influential than simple parameter count Kaplan et al. (2020). This implies that width and depth configuration [4] is more influential to the grouping characteristics of minima in the fitness landscape than to model complexity.
We noted prior that in some cases dominance between the two mode conventions appeared with the same network configuration when evaluated against different training samples [D]. It turns out this appeared to be more prevalent when applying initializations sampled from a uniform instead of normal distribution, as demonstrated here He scaled Normal appears to be stable across these three representative training samples [Fig 16], while Uniform, with an arbitrary scale of +/-1, has some deviation in mode dominance characteristics across those same samples [Fig 17].
Appendix I. Additional Training Samples
Most of the histograms in these appendices have inspected a single training sample for tail visibility. Evaluating 50 concurrent training samples produced significant tail contraction [Fig 18]. To help isolate a transition point, in [Fig 19] a histogram is shown based on 1, 2, and 3 aggregated samples. It appears that the first two samples were similar enough that no complicated function was needed to relate their inference, so the left tail was retained with their aggregation. The addition of a third sample resulted in the left tail compressing to close proximity to the central mode [Fig 20]. Note that a small indication of a left tail is visible adjacent to the leftmost peak.
Appendix J. Speculative Speculations
We hope that the reader may humor us with these few additional, more speculative, speculations about the implications of the conjectures in this work.
- Conservation of complexity: The interpolation peak in the data fidelity double descent curve [Fig 5], as well as what we have noted about expectation that increasing the “ratio of influence” for increased neuron information density is more impactful to minima grouping characteristics in the loss manifold than it is to the model complexity [Appendix H], suggests to us that there may be a form of complexity conservation for a given architecture at a given loss value, at least from a classical superposition standpoint, although we expect if you consider the distribution of models in a quantum neural network there may be equivalent performing models in classical space but with different complexities in quantum space, similar to what we have seen referred to as partial equivalence Chen et al. (2022).
- Barren plateaus: We noted in [4] that a degenerate quantum fisher information matrix rank resembling the low fidelity scenario for classical networks is primarily seen with quantum neural networks in cases of barren plateaus Holmes et al. (2021). We speculate that there may be a relevance to the complexity and disorder progression noted in [Fig 5], such that as disorder progresses with increasing entropy for a system by the 2nd law, it will cause some superposition elements to be inaccessible to a trained model. We expect that these obscured regions may manifest to a quantum network’s loss manifold as a barren plateau.
- Generalization bias: We used the term “inherent generalization bias” at the close of [4]. This is a somewhat remarkable concept, and we think that there may be relevance to theoretical physics. Consider what Roberts et al. (2022) noted that we can actually fully train infinite-width networks in one theoretical gradient-descent step. That is, we can take a giant leap right to the minimum of the loss. (pg 252), which can be achieved independent of the applied algorithm (pg 257). Roberts et al. (2022) also considered that at the infinite width case a fully-trained mean network output is just a linear model based on random features. In this sense, infinite width neural networks are rather shallow in terms of model complexity, however deep they may appear (pg 289). We expect for true emergence of the fundamental laws of physics, one would need an infinitely deep infinite width network.
- Hypergraphs: It has recently been proposed by Wolfram (2020) that the fundamental laws of physics are explainable by considering our universe as a progression of hypergraph node link updates. Based on the theory of computational equivalence Wolfram (2002), a neural network and a hypergraph network are all Turing equivalent. Thus, a contracting volume of distributions for a neural network is suggestive that with the progression of time and hypergraph update steps we are collectively honing in towards one final measurement. Let’s get it right.
For more essays please check out my Table of Contents, Book Recommendations and Music Recommendations.
© Nicholas Teague 2024, all rights reserved