Dropping Out for Fun and Profit
A Theory for Dropout
I don’t know the first thing about academic publishing. But. I’ve been ruminating for a while on some academic subjects, and well please consider this paper inspired by some academic works. It is not a real academic paper though, this is just for fun. With a soundtrack because that’s how I roll. Warning: there may be jokes. And without further ado.
ABSTRACT
The training of neural networks can often benefit by some form of regularization. The dropout technique for regularization was initially met with skepticism by the machine learning community as the mechanism behind its application is counterintuitive. This paper will propose an explanation for what is being accomplished through dropout by analogy to other forms of regularization. In short, we believe that similar in principle to how L1 regularization promotes sparsity of a weight set and L2 regularization dampens magnitude of a weight set, dropout dampens the nonlinearity of a model’s transformation function, by shifting composition to a set of increased linearity in a kind of (non-quantum) superposition.
INTRODUCTION
The training of a neural network is an imprecise science in that for an arbitrary evaluation there exists no hard and fast rules that allows a practitioner to identify an optimal neural network architecture. An architecture that is oversized for some complexity of data will likely become overfit to the properties from the training data without early stopping, while one that is undersized will suffer from decreased potential for predictive accuracy. Different categories of problems have been studied over time sufficiently for mainstream practice to progressively hone in on methods that are well suited to a given category, for example the range of architectures that have won each ImageNet competition over the years or modern practices for natural language processing such as BERT or Attention. But even for these examples, as the methods are directed to applications in the real world variations in target data properties will necessitate some deviation from the laboratory conditions to which these architectures have been optimized. Regularization for neural network training helps to overcome this challenge in that an architecture may be intentionally oversized for the complexity of the target dataset with the regularization helping to mitigate the potential for overfitting the data. (This section is probably an oversimplification of the challenges of designing neural networks but I think it captures some fundamental points that helps one to intuit the benefits of regularization.)
L1 and L2 REGULARIZATION
(This section is not intended as an exhaustive treatment of regularization for neural networks, however by presenting a few illustrative methods I hope to facilitate discussions which will later draw an analogy between these methods and the dropout technique.) L1 and L2 regularization are two common forms of regularization in which a parameter set derived from the collective weights of a network is added to the cost function used as an objective for backpropagation optimization in neural network training. For example, in L1 regularization the sum of the neural network weight absolute values are added to the cost function, and in L2 regularization the sum of the weight squares are added to the cost function — and in each case multiplied by regularization coefficient for scaling. (This regularization coefficient is itself suitable for optimization as a hyperparameter such as may be explored using a grid search during iterations of a training operation).
As the cost function is targeted for minimization through backpropagation, the impact of including these regularization methods is a push towards decreasing magnitude weights in the derived set. However there is a distinction between this shift realized for each approach. In the case of L1 regularization, the weight set becomes increasingly sparse with increasing regularization coefficient (by sparse meaning that more of the weight values of less import to the model approach or reach 0). In the case of L2 regularization, the trend with increasing regularization coefficient is less that of sparsity than it is of general decreasing magnitude of the collective weights such that the application has a stronger consistent penalization for weights, with this difference in result between L1 and L2 due to the summing the absolute value of each weight verses summing the square of each weight.
(Ok this is totally random but I’ve been meaning to write about this so just going to throw it in as an aside. Not even a tangent just a fun distraction. I once saw a really cool fan theory about the movie Weekend at Bernies, like I don’t know maybe it was a comment on Reddit or something who knows, but the theory went that like maybe these two friends on a weekend getaway (one a straight-laced bookworm square, and the other this like care-free go lucky always sees things positively type) were in fact two aspects of the same psyche. Like if you re-watch the movie they’re never talking to the same person at the same time, similar to that Edward Norton / Brad Pitt dynamic from Fight Club. If you really want to stretch your imagination, consider that Bernie himself might have been one too. Anyway good times.)
DROPOUT REGULARIZATION
(This section is not intended as an exhaustive treatment of dropout for neural networks, however by presenting a few illustrative points I hope to facilitate discussions which will later draw an analogy between these methods and the L1 / L2 techniques.) Dropout regularization is a different category than L1 and L2. Instead of adding parameters to the cost function, dropout works by way of, for each epoch of backpropagation training, randomly selecting a defined ratio of nodes in the neural network (along with their associated weights) to exclude from the cost function assessment. (You know, the whole tune in, turn on, drop out approach.)
The dropout technique for regularization was initially met with skepticism by the machine learning community — in fact Nitish Srivastava’s original paper was initially passed on for publication by a range of journals, presumably because the mechanism behind the technique was counterintuitive and defied convention, and it wasn’t until an expanded paper was written and co-authored with well-known researchers like Geoffrey Hinton that the principles were more widely accepted. The types of explanations I have seen proposed for the usefulness of dropout were succinctly summarized in Deep Learning by Goodfellow, Bengio, and Courville, and included the analogy that the subdividing of the original network into randomly selected subsets of neurons is similar to bagging, a kind of ensemble learning in which a meta model is assembled from a series of distinct models trained separately and aggregated. However dropout differs from bagging in that each collection is aggregated stochastically over the steps of backpropagation, such that each epoch’s weight shifts between neurons aren’t allowed to evolve into fully and distinctly trained models from the randomly selected subsets, but only shifted in the direction of some distinct potential downstream fully trained model from subsets that are progressively tuned into the final model’s superposition. Deep Learning also suggests that a way to think about dropout is that with each step’s random configuration of nodes, the optimization algorithm is forced to focus on different features from the training set, such that over the course of training the algorithm isn’t allowed to overfit to specific points.
(Ok this is kind of a stretch but since am drawing on movies let’s use another example. Consider the neurons of a neural network as like a collection of super heroes, like I don’t know maybe in Avengers or something. Now the dropout regularization technique is (oh hey spoiler alert) like what happens when Thanos gets all of the infinity stones — you know those cosmic relics that grant the users infinite powers along some axis — and Thanos, now an all powerful being, defeats some of those heroes. Not to worry I’m sure the Avengers will prevail, how might you ask will a collection of lowly earth bound superheroes overcome such a cosmic villain? Well you know, this is fiction. I’m sure they’ll find a way :)
NEURAL TRANSFORMATION FUNCTIONS
(Ok so this will be a bit of an abstraction, but we’re going to talk about the supervised learning training process and what is being derived through training. This is all consistent with what I previously wrote about in that From the Diaries of John Henry essay that I’m so well-known for (lol just kidding no one reads these things).) A fundamental way to think about neural networks is that in inference the generation of predictions is achieved through the feeding of a signal, such as is consistent in form and distribution to those training points used to derive the network, which is progressed through the network’s layers of neuron nodes via the application of a series of weightings and activation functions in a feedforward manner. Of course more elaborate modern practice may incorporate in this system a whole range of bells and whistles such as convolutions, LSTM’s, skip connections, or etc., but in principle the concepts discussed prior such as network size, overfit, and regularization still apply in all of these cases. The application of the network weightings and activations serves as a kind of transformation function (for example such as might transform a JPEG into a boolean signal detecting whether an image is a hot dog).
The demonstration here is meant to illustrate how the training optimization algorithm uses an evaluation of the potential transformations between the corresponding points in the training set x and the labels y to estimate a transformation function between those sets which maximizes the generalization across all of the given examples. Shown here are examples of polynomial transformation functions, but in practice the realization of these functions will likely carry extensive point-wise elements of nonlinearity such as might be relics of RELU (rectified linear) activation functions. Such point-wise elements of nonlinearity help enable the network to perform operations comparable to logic gates. However, there is a distinction between the logic gate sets that are realized in digital circuits verses those methods achieved in neural networks, in that there is a type of (non-quantum) superposition between different sets of logic gates realized in the network, in that the weightings between neurons may be contributing to multiple “logic gates” simultaneously. The application of dropout regularization helps the optimization algorithm to navigate this superposition in a fashion that is not available through other means.
(You know what I’m just going to stop trying to draw specific analogies with the silly stuff. This is what happens when Charlie Chaplin works a nine to five in Modern Times. I’m including because I think it is hilarious. Eat your heart out Lucy :)
A MEASURE OF LINEARITY
(Ok I’m going to make a few leaps here. I wouldn’t consider this extremely well-tested theory and am a little uncertain on the broad applicability. But my intuition is telling me this might be on to something.) The simplification of the neural network’s transformation function in the preceding section turns out to be useful in that it gives us a helpful way to intuit a property of the system, specifically the linearity of the transformation. A function can be said to be linear if for an arbitrary pair of points x1 and x2, the sum of the points fed into the function is equal to the sum of the function applied to each point separately, i.e. f(x1+x2) = f(x1) + f(x2). This is true for first order polynomial functions such as f(x)=cx but as the polynomial degree climbs this equality will rapidly diverge. A helpful way to picture this divergence can be found in figure 5.2 of the Deep Learning text for instance, which helps to illustrate that when a model has reached a state of overfit, it is equivalent to a fitted high order polynomial which becomes unstable. So one of the leaps here is that I’m proposing a simple measure for a degree of linearity of a system, which is stated here based on a single point input but I think could potentially be extended for systems of high complexity input. (Even if the output neuron has sigmoid activation for instance, we could look at the step preceding the coarse graining for a continuous function to evaluate I expect.)
Ok so this gets us to the meat of the argument. What I am proposing as an explanation for the utility of dropout is that by randomly shifting the composition of the neural architecture in each step of backpropagation, we are promoting through training the composition of all of those subsets of weightings in which are in a superposition of functional logic gate sets such that the transformation function from each of these subsets in superposition is increasingly additive to the other subsets in it’s output. This increasing trend toward additivity is realized by shifting the randomly selected subsets of weightings towards transformation functions which would evaluate to an increased approximation for degree of linearity (such as presented above for decreasing Z), and thus the transformation function associated with the complete set of weightings would also trend for decreasing Z. This trend towards decreasing Z with increased dropout ratio can be considered analogous towards L1 regularization’s trend towards sparsity or L2 regularization’s trend towards weight minimization with increasing regularization coefficients.
(Consider this a plug. This movie is great you should go see it, now Stop Making Sense.)
CONCLUSION
Part of the reason for the initial disregard for the currently mainstream practice of dropout regularization is that the mechanism behind its application is somewhat counterintuitive and outside the convention of other forms of regularization. This paper has proposed what we believe is a novel explanation for the utility of dropout which draws on some abstractions and analogy. In short, we believe that the application of dropout serves to shift the transformation function realized through the trained neural network weightings to one of increased linearity, analogous to how L1 regularization and L2 regularization shift the weightings to conditions of sparsity or macro reduction. We have also proposed a method for evaluating the piecewise linearity of simple forms of transformation functions such as may serve to demonstrate this effect.
May the Lord bless you from Zion
All the days of your life;
May you see the prosperity of Jerusalem,
And may you live to see your children’s children.
Psalm 128
Citations
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”
Yoshua Bengio “Practical Recommendations for Gradient-Based Training of Deep Architectures” (arXiv:1206.5533)
Ian Goodfellow, Yoshua Bengio, and Aaron Courville “Deep Learning” (link to purchase on Amazon)
Nassim N. Taleb, Elie Canetti, Tidiane Kinda, Elena Loukoianova, Christian Schmieder “A New Heuristic Measure of Fragility and Tail Risks: Application to Stress Testing” (link to paper)
Weisstein, Eric W. “Polynomial Curve.” From MathWorld — A Wolfram Web Resource. http://mathworld.wolfram.com/PolynomialCurve.html
Further Reading
- For more of From the Diaries of John Henry: Table of Contents, Book Recommendations, and Music Recommendations.
- Or for some really nifty open source software that automates data-wrangling for machine learning check out Automunge at: automunge.com