Deep Learning: Regularization Notes

In previous article (long ago, now I am back!!) I talked about overfitting and the problems faced due to overfitting. In this article I will discuss about one of the possible solution to prevent overfitting i.e. regularization (short notes from and Stanford’s CS231n course).

A central problem in machine learning is how to make an algorithm that will perform well not just on the training data, but also on new input test data. Many strategies used in machine learning are explicitly designed to reduce the test error, possibly at the expense of increased training error. In another way we can say that any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error is regularization.

There are many regularization strategies. Some put extra constraints on a machine learning model, such as adding restrictions on the parameter values. Some add extra terms in the objective function that can be thought of as corresponding to a soft constraint on the parameter values. These strategies are collectively known as regularization.

In fact, developing more effective regularization strategies has been one of the major research efforts in the machine learning field. Sometimes these constraints and penalties are designed to encode specific kinds of prior knowledge. Other times, these constraints and penalties are designed to express a generic preference for a simpler model class in order to promote generalization. Sometimes penalties and constraints are necessary to make an underdetermined problem determined.

There are several form of regularizations by which we can prevent overfitting in our network or machine learning model.

Parameter Norm Penalties

Many regularization approaches are based on limiting the capacity of models, such as neural networks, linear regression, or logistic regression, by adding a parameter norm penalty Ω(θ) to the objective function J. We denote the regularized objective function by J.

J(θ; X, y) = J(θ; X, y) + αΩ(θ) — {1}

where α ∈[0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty term, Ω, relative to the standard objective function J. Setting α to zero results in no regularization. Larger values of α correspond to more regularization.

We note that for neural networks, we typically choose to use a parameter norm penalty Ω that penalizes only the weights of the affine transformation at each layer and leaves the biases unregularized. The biases typically require less data to fit accurately than the weights. Each weight specifies how two variables interact. Fitting the weight well requires observing both variables in a variety of conditions. Each bias controls only a single variable. This means that we do not induce too much variance by leaving the biases unregularized. Also, regularizing the bias parameters can introduce a significant amount of underfitting. We therefore use the vector w to indicate all of the weights that should be affected by a norm penalty, while the vector θ denotes all of the parameters, including both w and the unregularized parameters.

The left image depicts, How a 9 degree polynomial equation is overfitting our training dataset but when we apply regularization (right image) the model starts to generalize.

L² regularization: It is one of the commonly used regularization form. The L² parameter norm penalty commonly known as weight decay. L² regularization drives the weights closer to origin by adding a regularization term Ω(θ) = 1/2||w||²₂ to the objective function. Such a model has following total objective function:

J(w; X, y) =α/2(w`w) + J(w; X, y) ( ` means transpose)

The L² regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. Due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather that some of its inputs a lot.

Lastly, also notice that during gradient descent parameter update, using the L² regularization ultimately means that every weight is decayed linearly: W += -lambda * W towards zero. Let’s see what does this means, We can see that the addition of the weight decay term has modified the learning rule to multiplicatively shrink the weight vector by a constant factor on each step, just before performing the usual gradient update. This describes what happens in a single step. But what happens over the entire course of training? The L² regularization causes the learning algorithm to “perceive” the input X as having higher variance, which makes it shrink the weights on features whose covariance with the output target is low compared to this added variance.

regularization: The L¹ regularization has the intriguing and fascinating property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L¹ regularization end up using only a sparse subset of their most important inputs as most weight goes very close to zero and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L² regularization are usually diffuse, small numbers. The sparsity property induced by L¹ regularization has been used extensively as a feature selection mechanism. The L¹ penalty causes a subset of the weights to become zero, suggesting that the corresponding features may safely be discarded. In practice, if you are not concerned with explicit feature selection, L² regularization can be expected to give superior performance over L1.

Formally, L² regularization on the model parameter w is defined as

So that was all about the two famously used regularization techniques, I hope you like it and if so please give a clap. In my next article I will again comeback with some other and famous methods to prevent overfitting.

Articles in Sequence:

  2. Deep Learning: Basic Mathematics for Deep Learning
  3. Deep Learning: Feedforward Neural Network
  4. Back Propagation
  5. Overfitting