# Regularization in Machine Learning

The major concern while training your neural network or any machine learning model is to avoid overfitting. The model will not be accurate enough if it is overfitting because it is trying too hard to capture the noise present in the training dataset. The noise are the data points that don’t really represent the true properties of data, but random chance. Learning such data points leads to high risk of model overfitting.

There are various techniques for avoiding **overfitting**. One of them is using cross validation, that helps in estimating the error over test set, and in deciding what parameters work best for your model which is also known as hyper-tuning. Another technique is regularization that we’ll be discussing in this article.

# Definition

**Regularization** is the method used to reduce the error by fitting a function appropriately on the given training set while avoiding overfitting of the model.

# Commonly used techniques to avoid overfitting

**L2 regularization —**It is the most common form of regularization. It penalizes the squared magnitude of all parameters in the objective function calculation. For every weight*w*in the network, the term*(λw^2)/2*is added to the objective, where λ is the regularization strength. It is common to see the factor of 1/2 in front because then the gradient of this term with respect to the parameter*w*is simply*λw*instead of*2λw*. It heavily penalizes the large weight vectors and preferring the smaller weight vectors.

In the above equation, L is any loss function and F denotes the Frobenius norm.

2.** L1 regularization —** It is another common form of regularization, where for each weight *w,* the term *λ|w| *is added* *to the objective. Also, we can combine the L1 regularization with the L2 regularization: *λ1|w|+λ2w^2*. Neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In general, L2 regularization can be expected to give better performance over L1.

3.** Dropout** — It is an extremely effective, simple and recently introduced regularization technique. It is implemented by only keeping a neuron active with some probability *p*, or setting it to zero otherwise while training. Because the outputs of a layer under dropout are randomly subsampled, it has the effect of reducing the capacity the network during training.It can be implemented on any or all hidden layers in the network as well as on the input layer. It cannot be used on the output layer.

4. **Data Augmentation** — Another way to reduce overfitting is to increase the data size. In machine learning problems, we were not able to increase the size of training data as the labeled data was too costly. But in the case of images, we can increase the dataset size by flipping, rotating, scaling or shifting the image to create new samples.

5. **Early stopping** — It is form of cross-validation strategy where we keep one part of the training set separate as the validation set. When we see that the performance on the validation set is getting worse, we immediately stop the training on the model. This is known as early stopping.

# Conclusion

- Overfitting occurs in more complex neural network models (i.e networks with many layers or many neurons)
- Complexity of the neural network can be reduced by using L1 and L2 regularization as well as dropout
- L1 regularization forces the weight parameters to become zero
- L2 regularization forces the weight parameters towards zero (but never exactly zero)
- Smaller weight parameters make some neurons contribution negligible → neural network becomes less complex → less overfitting
- During dropout, some neurons get deactivated with a random probability p → Neural network becomes less complex