L2 Regularisation: Maths

RAHUL JAIN
3 min readFeb 8, 2020

--

L2 regularisation, one of the most popular techniques in Machine Learning, is a method to reduce the variance of the model and increase the bias to make the model more generalisable.

Geometrical Representation of L2 Regularisation

L2 regularisation is a MAP estimation with Gaussian prior probabilities. Due to the imposed prior, the model can generalise the data well by scaling the weights as per their significance.

The L2 regularised objective function and its gradient is given by,

From Taylor series expansion, we will make a quadratic approximation of the unregularized objective function around the point w*. w* is the minimiser of the unregularized objective function.

Thus, the regularised objective function and its gradient becomes,

Let represents the minimiser of the regularised objective function. The slope at this point is equal to zero.

Since H is positive semi-definite, real and symmetric matrix, it can be decomposed into a diagonal matrix Λ and an orthonormal basis of eigenvectors, Q, such that,

The effect of weight decay is to rescale w∗ along the axes defined by the eigenvectors of H. The component of w~ along the ith eigenvector of H is rescaled.

If λi >> α, the effect of regularisation is relatively small. However, components with λi << α will be shrunk to have nearly zero magnitudes. Only directions along which the parameters contribute significantly to reducing the objective function are preserved relatively intact. In other unimportant instructions, indicated by a small eigenvalue of the Hessian, weight vectors are decayed away through the use of the regularisation throughout the training.

The regularisation constant, α is a hyperparameter and is tuned to get the best results. As the value of α increases, weights are decayed more.

References:

Deep Learning (Adaptive Computation and Machine Learning series) — By Ian Goodfellow, Yoshua Bengio, Aaron Courville

--

--