Why “early-stopping” works as Regularization?

RAHUL JAIN
3 min readFeb 8, 2020

--

While training our machine learning models, we often observe that with a model with significant learning capacity training error steadily decreases but after some epochs validation error starts increasing.

After each epoch, the model learns the data and updates the weights accordingly. Training and validation error decreases as long as our model is generalising the input data. After some iterations, the model starts memorising the data and even though training error decreases the validation error increases, causing the overfitting of the model.

Regularization is a technique to avoid the overfitting of the models with large learning capacity. The regularization techniques increase the bias and reduce the variance of the model.

Early Stopping

If the performance of the model on the validation dataset starts to degrade (e.g. loss begins to increase, or accuracy begins to decrease), then the training process is stopped. The model at this stage have low variance and is known to generalize the data well. Training the model further would increase the variance of the model and lead to overfitting. This regularization technique is called “early stopping”.

We can show that for a simple linear model with a quadratic error function and simple gradient descent, early stopping is equivalent to L2 regularization.

From Taylor series expansion, we will make a quadratic approximation around the empirically optimal value of the weights w*. The matrix H is the hessian matrix.

The gradient of this approximation comes out to be a linear function of H and w. Now, we can calculate the weights at each step of the gradient descent.

Since the Hessian matrix is real, symmetric and positive semi-definite. By eigenvalue decomposition. matrix H can be written as,

Assuming ε is chosen to be small and w(0) = 0. The weights after 𝜏 iterations are:

The optimal weights obtained on imposing L2 norm penalty on the objective function is given by (refer):

Comparing Eq(1) and Eq(2), we derive the following relation between 𝜏, w, and ε. The equivalence can be seen between the L2 regularization and early stopping.

Taking log both sides and using the series approximation of log(1+x), we can conclude that if all λi are small (that is, ελi << 1 and λi/α << 1) then the following equation holds.

Here, α is the regularization constant, 𝜏 is no. of iterations, and ε is the learning rate.

Increasing no. of epochs/iterations 𝜏 is equivalent to reducing the regularization constant. Similarly, early stopping the model, i.e. reducing the no. of iterations is similar to L2 regularization with large α. Thus, we can say that early stopping regularizes the model.

References:

Deep Learning (Adaptive Computation and Machine Learning series) — By Ian Goodfellow, Yoshua Bengio, Aaron Courville

--

--