Regularisation Techniques in Machine Learning and Deep Learning

Saurabh Singh
Analytics Vidhya
Published in
11 min readOct 8, 2019
Photo by Chris Liverani on Unsplash

One of the most common problems faced by machine learning and deep learning practitioners while building an ML model is “Overfitting”.

What is Overfitting?

A Machine Learning model is said to be “overfitting” when it performs well on the training dataset, but the performance is comparatively poor on the test/unseen dataset.

Lets go through an interesting and simple example to get the intuition behind overfitting.

Consider 2 students who are preparing for an exam by studying from the same book. Student 1 tries to memorize the questions and answers in the book without trying to understand the underlying concepts in different topics present in the book. Student 2 tries to grasp the concepts behind each topic rather than memorizing them like student 1.

The exam paper contains half a number of questions in the same way as they are present in the book whereas the remaining questions are similar, but are made tricky to test the understanding of students.

In the above scenario, student 1 is not able to perform very well in the exam as he is only able to answer the straightforward questions that were asked from the book whereas, for the remaining questions that were a bit tricky, he wasn’t able to answer them so well. In comparison to this, student 2 performed well on each and every question in the test as he had a better understanding of the concepts that he learned from that book.

We always want our Machine Learning(ML) model to be more like student 2 than student 1.

What is Underfitting?

It’s a counterpart of Overfitting, and also an important concept that comes up whenever we talk about Overfitting.

An ML model is said to be underfitting if it does not performs well on both the train as well as the test dataset.

E.g. A student who has neither memorized any questions from the book nor did he try to understand any of the concepts in the book. Thus, he was unable to perform well on both the straightforward questions as well as the tricky questions in the exam.

“Similarly, we always want to build a machine learning model which understands the underlying pattern in the training dataset and develops an input-output relationship that helps in making better predictions on both the train as well as the test/unseen dataset.”

Below is a pictorial representation of Overfitting, Underfitting, and Best/Appropriate fitting.

Image Source: Google

ML and DL models can easily overfit during training phase. Some reasons for this can be having a dataset with less no. of data points but high number of features. The model may end up memorizing the relationship between input and output with respect to each and every feature because as the no of data points are less, no such significant pattern may be found by the model in the dataset during training.

Another cause of overfitting can be non-linear machine learning algorithms like Decision Trees, because these algorithms have more freedom in building the model based on the available dataset where its branches & nodes may end up building a very deep Decision Tree that will end up performing very well on the train data compared to the test data.

There are many regularisation techniques in ML and DL that can help us to prevent overfitting of models. Different types of regularisation techniques are discussed below in detail.

(I) L1 and L2 regularisation:-

In both L1 and L2 regularisation, the model is penalized for overfitting on train data i.e. whenever the model tries to predict everything correctly on train data points, some penalty is added to the loss function in terms of the coefficients of the model.

In many Machine Learning technique like Logistic Regression, Support Vector Machine, etc. as well as in Deep Learning techniques, we add a regularisation term(penalty) to the “loss function” so that the loss term does not becomes zero or close to zero for the train data.

Below is the “Logistic Loss” function which is the loss function in case of “Logistic Regression”:

Image Source:- stackoverflow.com

In the above image, the loss function is without the regularisation term.

The ML model will try to reduce the log loss to a very small value close to zero.

If the loss function is without the regularisation term, then the ML model will increase the weight parameter “x” to a very high value(ideally infinity) to make the overall loss close to zero. But this is something that will result in overfitting of the ML model as it will perform very well on the training set which we want to avoid.

L2 regularisation(Ridge):-

To avoid overfitting we add a regularisation term as shown below:

Image Source:- stackoverflow.com

The 2nd term in the loss function is the “L2” regularisation term. Here, the “squared magnitude of weight parameter” is added along with lambda (which is the hyperparameter to be tuned while building the model) to the logistic loss function.

L2 regularisation is one of the most widely used and proven regularisation techniques used by ML practitioners that helps us to build robust ML models that are able to generalize well.

If the weight co_efficient “x” is made high, to reduce the 1st term in the loss function close to zero, then the second term will increase, thereby avoiding the overall loss function value from becoming zero. This way, the regularisation term penalizes the model for trying to make very accurate predictions on the training dataset points.

Features of L2 regularisation:-

  • L2 regularisation, also known as “Ridge regression” performs better than L1 regularisation in most the cases.
  • The less important features are shrunk to low values but are not made zero.

L1 Regularisation(Lasso):-

Below is the loss function with L1 regularisation term added in it:

Image Source:- stackoverflow.com

Here, the “absolute value of weight parameter” is added along with lambda (which is the hyperparameter) to the loss function.

Similar to L2 regularisation, if the weight co_efficient “x” is made high, to reduce the value of 1st term in the loss function close to zero, then the second term i.e. the L1 regularisation term will increase, thereby avoiding the overall loss function from becoming zero.

L1 regularisation penalizes the model less compared to L2 regularisation as it uses absolute values rather than the squared values of weight parameters in the loss function.

Features of L1 regularisation:-

  • L1 regularisation, also known as Lasso Regression, makes the less important features to zero, unlike L2 regularisation.
  • Thus, L1 performs internal feature selection. Because of this, it is preferred in applications where we have some kind of hard cap on the number of features we can use.

Elastic-Net Regularisation:-

Elastic-Net Regularisation is a combination of both L1 and L2 regularisation. It can be represented as shown below:

Source: stats.stactexchange.com

Alpha in the above formula is the same as the term lambda used in the case of L1 and L2 regularisation formula.

The overall penalty applied to the ML model to penalize for overfitting is more in Elastic-Net regularisation compared to L1 and L2 regularisation.

Features of Elastic Net Regularisation:-

  • Elastic-net is a compromise between the L1 and L2 regularisation that attempts to shrink and do a sparse selection simultaneously.

The SKLearn’s implementation of different ML algorithms have the term called as “penalty” where we can specify one of the above 3 mentioned regularisation techniques that we want to use while training the model. Below are some images from the SKLearn’s SGD Classifier documentation which represents the default value of the “penalty” term and also the available options.

Image source:- https://scikit-learn.org/stable/modules
Image source:- https://scikit-learn.org/stable/modules

Sklearn’s Linear model module also has the Lasso(L1), Ridge(L2) and Elastic-Net Regression and Classification sub-modules that can be directly used on a dataset as shown below.

Example:- Below is an example of the above 3 regularisation techniques applied on the very famous Boston House Prediction dataset and the coefficients obtained from each of them:

1. Loading dataset in Pandas data-frame
2. Train-Test Split

(I) Ridge Regression(L2):

3. Applying Ridge Regression

As we can observe, the weight of the less important features are reduced but are not made zero in case of Ridge Regression.

(II) Lasso Regression(L1):

4. Applying Lasso Regression

As we can observe, the L1 regularisation or Lasso, makes the less important features as zero, thus performing internal feature selection.

(III) Elastic-Net Regularisation:

5. Applying Elastic-Net Regression

(II) Data Augmentation:-

Although not very widely discussed as compared to other techniques in case of regularisation, Data Augmentation can help us to reduce overfitting.

A dataset with less number of data points but high number of features is more prone to overfitting. Data Augmentation refers to adding more relevant data in the training set such that the total number of data points used for training the model increases, and is sufficient enough for the model to understand the underlying pattern in the data so that it can generalize well.

However, the process of collecting data is costly and time-consuming. Also, finding data relevant to the problem we are solving is not always easy to obtain.

Deep Learning library Keras has a module called as ImageDataGenerator which is used for data augmentation. DL models include very high number of weight parameters, and thus are prone to overfitting if the total no of training data point are less. Thus, we use Data augmentation techniques to generate new data points from the existing set of data points by performing various transformations on the existing data points, e.g. operations like zooming, flipping, rotating the images, etc. are performed by the ImageDataGenerator module on the existing training data Images to develop new data images that are used for training the DL model.

2(a). Loading CIFAR Dataset
2(b). Applying Data Augmentation on CIFAR train data using the ImageDataGenerator module in Keras

(III) DropOuts:-

Dropout is one of the most widely used regularisation techniques in Deep Learning.

Dropout, as the name suggests, is based on the process of randomly dropping nodes” in a neural network. We specify a probability value for dropout which indicates the probability of a node getting dropped in each iteration.

Image source: Wikipedia

Suppose, we specify the probability of a node getting dropped as 0.5(i.e., flip of a coin). In every iteration, some of the nodes from both input as well as the hidden layers are dropped, resulting in a more simpler neural network that makes decisions based on the available nodes only. As lesser no of nodes are available in each iteration after applying Dropout, the computation time at each iteration is reduced.

Thus, using neural networks with different sets of nodes in each iteration can help us to capture more randomness in the data and usually performs better than using a single and fully connected neural network.

Image Source: researchgate.net

This is similar to the ensemble models in ML(like Random Forests and GBDT) which uses multiple learners to predict the output.

Reference : https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/

I above image, the value of “0.2” in Dropout represents the probability value of each node getting dropped. The probability value is a hyperparameter that must be tuned so that we can get the best probability value for Dropout that will help us to obtain the optimum results.

Below is an example of applying CNN with Dropout and without dropout on the MNIST dataset.

(1) CNN with Dropout:-

Loading Data
CNN with Dropout=0.5
Fitting the model on training data and specifying the optimizer and metrics
Results of CNN with Dropout

An accuracy close to 99.5 percent was achieved on the test data, by using CNN with Dropout.

(2) CNN without Dropout:-

CNN without Dropout
Results of CNN without Dropout

An accuracy close to 99 percent was achieved on the test data, by using CNN without Dropout.

Although the accuracy are almost same in both the cases, we can see that using dropout resulted in a model that was able to generalize well.

(IV) Early Stopping:-

Early Stopping is another very widely used regularisation technique to avoid overfitting while building ML and Deep Learning models. As the name suggests, we “stop early” during the training phase, before the model starts overfitting on the training dataset.

Here, we use a validation set along with the training set, and we monitor the validation error/loss before deciding on when the model will stop training further.

Source:- fouryears.eu

In the above image, the model will stop training at the “blue line” because after the blue line, the CV error starts increasing whereas the training error continues to decrease resulting in overfitting.

Source :- https://machinelearningmastery.com

In the above image, the “monitor” value denotes the metric that will be monitored during the training phase to decide on when the model will stop training further. Here, we are monitoring the “validation loss” during the training phase. The value of “patience” indicates after how many iterations the model will stop training as it finds no further improvement in the “validation error”.

Some ML algorithms include an early stopping parameter where we have to specify the early stopping value.

Below is an example of the XGBoost model applied on the Donor’s Choose dataset with early_stopping=20.

1. Loading and reading dataset
2. Train-Test Split
3. Applying model.
4. XgBoost with early stopping = 20

As per picture 4, no. of early stopping rounds = 20. “Validation error” is the performance metric that is monitored. The model is trained until the validation error does not improves till 20 rounds. This also help us to save the total training time because if the no of epochs specified are 100 with early stopping rounds = 20, but while training, after the 50th epoch, if the model does not shows any improvement in the next 20 epochs, then it will stop training further thereby avoiding overfitting.

List of regularisation techniques discussed in this post.

This brings an end to this blog on Regularisation techniques in ML and DL which includes some of the most widely used regularisation techniques in ML and DL that has helped ML practitioners in building robust ML models that are able to generalize well.

My next blog will be on “Performance metrics in ML and DL” were we will dive deep into the details of some of the most commonly used performance metrics and discuss the pros and cons of each of them.

Please share your necessary feedback and questions.

--

--