Regularization in Machine Learning and Deep Learning

Machine Learning is having finite training data and infinite number of hypothesis hence selecting the right hypothesis is a great challenge.

Amod Kolwalkar
Analytics Vidhya
10 min readSep 20, 2019

--

Model Fitting Scenarios

Before coming to regularization let’s try to understand what model fitting is.

This image shows the need for bias-variance trade-off,which is like the sweet spot, before coming to technical jargon, let’s try to understand these concepts from layman’s point of view.

Your task is to raise a child, and your objective is to raise him well.What is the level of flexibility would you want to give your child in his upbringing?

Objective function: Holistic development of the child.

If you just make the kid study for exams or try to make your kid an athlete by making him devoid of any relaxation or buffer time,then your over fitting, that means your kid can only do a specific type of task,kind of a specialization in something,you wont be able to have a generalized understanding of other things in life other than that one task.

Generally kids want to play on mobile phones,eat chocolates and ice-creams and live a life full of sunshine and rainbow.

Introduce Regularization: Where you try to hit the sweet spot

  • If you want to have chocolates distribute it equally to your friends and family
  • If you want to play games finish the homework.

These are some types of regularization we have in common day parenting.

Bias-Variance Trade-off is a tug of war between over and under fit.Over fit is like studying for a test just by going through the previous year questions whereas under fit is like reading the summary of the chapters and answering the test. In the general test cases both the cases will fail. To combat this we have regularization.

Regularization for Machine Learning

Regularization is “any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.”

If a model performs perfectly in the training dataset it cannot be guaranteed to perform as well in new datasets. But normally speaking what the network encounters is unseen data when we use it. The most straightforward strategy is just increase the dataset but it is not realistic most of the time. The modification is mostly some mathematical improvements, for instance a more proper loss function, early stopping, drop out and etc

In this section we will look into L1 and L2 regularization. Regularization which is done using L1 and L2 norm. There is often confusion regarding terminology,hence here is the equivalent terms for the same:

Regularization in Linear Models

The type of regularization’s mentioned here are the ones used in Linear Models i.e where classifier or regressor can be expressed as a line or hyperplane such as Linear Regression, Logistic Regression and Support Vector Machines.

Euclidean norm == Euclidean length == L2 norm == L2 distance

Manhattan norm == Manhattan length == L1 norm == L1 distance

https://medium.com/@kolwalkaramod96/hello-f4a9317ad you can go through this link to check for distance metrics.

L2 vs L1 Plot
3-D Contour Plot of The Same

This plot helps to distinguish between L1 and L2 norm.The blue color curve is L2 norm and the red line which is not continuous at zero is L1 norm.L1 norm is not continuous hence it creates sparsity, L2 is differentiable hence we can use techniques like Stochastic Gradient descent.

Equation of general learning model

Optimization function = Loss + Regularization term

If the model is Logistic Regression then the loss is log-loss, if the model is Support Vector Machine the the loss is hinge-loss.

If the model is a neural network then it will be some form of cross-entropy loss.

L1 and L2 norm is applicable in Deep Learning models also.

L1 and L2 general optimization equation

Here, lambda is the regularization parameter. It is the hyper-parameter whose value is optimized for better results. L2 regularization is also known as weight decay as it forces the weights to decay towards zero (but not exactly zero).

If lambda=0 the model will tend to over fit as the optimization equation will only be the loss function.If lambda=infinity or some large value the model will tend to under fit. We can get optimal lambda by hyper-parameter tuning of the model.

Comparison between L1 and L2 Regularization

If you want to use the best of both worlds you can you use elastic net which uses both L1 and L2 regularizers in the objective function.

To obtain the best results on regularization it’s best to use cross validation on regularization. The below link is for an intro in cross validation.

https://machinelearningmastery.com/k-fold-cross-validation/

Let’s look at regularization in case of SVM as it’s a more generalized logistic regression with it’s margin maximizing strategy.

The regularization parameter (lambda) serves as a degree of importance that is given to miss-classifications. SVM pose a quadratic optimization problem that looks for maximizing the margin between both classes and minimizing the amount of miss-classifications. However, for non-separable problems, in order to find a solution, the miss-classification constraint must be relaxed, and this is done by setting the mentioned “regularization”.

So, intuitively, as lambda grows larger the less the wrongly classified examples are allowed (or the highest the price the pay in the loss function). Then when lambda tends to infinite the solution tends to the hard-margin (allow no miss-classification). When lambda tends to 0 (without being 0) the more the miss-classifications are allowed.

There is definitely a tradeoff between these two and normally smaller lambdas, but not too small, generalize well. Below are three examples for linear SVM classification (binary).

For non-linear-kernel SVM the idea is the similar. Given this, for higher values of lambda there is a higher possibility of overfitting, while for lower values of lambda there is higher possibilities of underfitting.

The images below show the behavior for RBF Kernel, letting the sigma parameter fixed on 1 and trying lambda = 0.01 and lambda = 10

You can say the first figure where lambda is lower is more “relaxed” than the second figure where data is intended to be fitted more precisely.

Regularization in Tree Based Models

Tree based models are models such as Decision Trees, Random Forrest etc

Other than L1 and L2 regularization there are specific regularization techniques for Tree models, Tree models generally overfit hence L1 and L2 may not be the best regularizer for these models.

  1. Limit the max depth of the trees
  2. Set stricter stopping criterion on when to split a node further (e.g. min gain, number of samples etc.)
  3. Hyper-parameter Tuning the model and plot for train and cross validation
  4. If you are using a Decision Tree as base model you can use an ensemble like a Random Forrest

Gradient Boosted Decision Trees

It’s an ensemble model using Decision Trees as base model, boosting is an ensemble technique but works differently as compared to random Forrest.

For more details on this refer to this https://stats.stackexchange.com/questions/173390/gradient-boosting-tree-vs-random-forest

Some of the techniques used in Gradient Boosting are:

  1. Hyper-Parameter Tuning-One natural regularization parameter is the number of gradient boosting iterations M (i.e. the number of trees in the model when the base learner is a decision tree). Increasing M reduces the error on training set, but setting it too high may lead to overfitting. An optimal value of M is often selected by monitoring prediction error on a separate validation data set. Besides controlling M, several other regularization techniques are used.
  2. Another regularization parameter is the depth of the trees. The higher this value the more likely the model will overfit the training data.
  3. Shrinkage-An important part of gradient boosting method is regularization by shrinkage which consists in modifying the update rule as follows:

F m ( x ) = F m − 1 ( x ) + ν ⋅ γ m h m ( x ) , 0 < ν ≤ 1 ,

where parameter ν

is called the “learning rate”.

Empirically it has been found that using small learning rates (such as ν <0.1) yields dramatic improvements in models’ generalization ability over gradient boosting without shrinking ( ν = 1). However, it comes at the price of increasing computational time both during training and querying: lower learning rate requires more iterations.

Similar methods of limiting max-depth and tree pruning can be used

Regularization in Naive Bayes

Laplace Smoothing or Additive Smoothing

It is a technique used to smooth categorical data . Given an observation x = ⟨ x 1 , x 2 , … , x d ⟩

from a multinomial distribution with N trials, a “smoothed” version of the data gives the estimator:

θ ^ i = x i + α N + α d ( i = 1 , … , d ) ,

where the “pseudocount” α > 0 is a smoothing parameter. α = 0 corresponds to no smoothing. Additive smoothing is a type of shrinkage estimator, as the resulting estimate will be between the empirical probability (relative frequency) x i N , and the uniform distribution 1/ d .

You always need this ‘fail-safe’ probability.

To see why consider the worst case where none of the words in the training sample appear in the test sentence. In this case, under your model we would conclude that the sentence is impossible but it clearly exists creating a contradiction.

Another extreme example is the test sentence “Alex met Steve.” where “met” appears several times in the training sample but “Alex” and “Steve” don’t. Your model would conclude this statement is very likely which is not true.

Regularization in Deep Learning

L1 and L2 regularization holds true for Deep Learning also but along with there are some more methods also such as

  • Dropout
  • Data Augmentation
  • Early Stopping

Dropout

To understand what is dropouts lets look at a classical neural network.

Dropout is a regularization technique patented by Google[1] for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It is a very efficient way of performing model averaging with neural networks.[2] The term “dropout” refers to dropping out units (both hidden and visible) in a neural network.[3][4]

Data Augmentation

The simplest way to reduce over-fitting is to increase the size of the training data. In machine learning, we were not able to increase the size of training data as the labeled data was too costly.

But, now let’s consider we are dealing with images. In this case, there are a few ways of increasing the size of the training data — rotating the image, flipping, scaling, shifting, etc. In the below image, some transformation has been done on the handwritten digits dataset.

Early stopping

Early stopping is a kind of cross-validation strategy where we keep one part of the training set as the validation set. When we see that the performance on the validation set is getting worse, we immediately stop the training on the model. This is known as early stopping.

Code Snippets

Now since we know some theory let’s get started by taking a data set. The dataset we are considering is Malaria Cell Image data. https://www.kaggle.com/iarunava/cell-images-for-detecting-malaria

There are two classes of images the malaria infected and non infected images.

L1 and L2 Regularization

In keras, we can directly apply regularization to any layer using the regularizers. I have applied regularizer on dense layer having 100 neurons and relu activation function.

Dropout

So each iteration has a different set of nodes and this results in a different set of outputs. It can also be thought of as an ensemble technique in machine learning.

Ensemble models usually perform better than a single model as they capture more randomness. Similarly, dropout also performs better than a normal neural network model.

This probability of choosing how many nodes should be dropped is the hyperparameter of the dropout function.

Due to these reasons, dropout is usually preferred when we have a large neural network structure in order to introduce more randomness.

In keras, we can implement dropout using the keras layer. Below is the Dropout Implementation. I have introduced dropout of 0.2 as the probability of dropping in my neural network architecture after last hidden layer having 64 kernels and after first dense layer having 500 neurons.

Data Augmentation

In keras, we can perform all of these transformations using ImageDataGenerator. It has a big list of arguments which you you can use to pre-process your training data.

Early Stopping

In keras, we can apply early stopping using the callbacks function. Below is the implementation code for it.I have applied early stopping so that it will stop immediately if validation error will not decreased after 3 epochs.

Combing all these elements

The accuracy we get is around 95%+ to classify if image cell contains malaria or not.

Reference

--

--