# Regularization — Tackling Overfitting

**Regularization** is a principle that penalizes complex models so that they can generalize better. It prevents overfitting. In this blog, we will visit common regularization techniques.

Your neural network is only as good as the data you feed it.

## Data Augmentation

The performance of deep learning neural networks often improves with the amount of data available. But we don’t usually have a huge amount of data. Data augmentation is a technique to artificially create new training data from existing training data. Depending upon when we apply these transformations we have two types of augmentation:

**Online — **perform all the necessary transformations beforehand

**Offline — **transformations on a mini-batch, just before feeding it to ML model

**Examples of image data augmentation**

PCA

Flipping, Rotation

Cropping, Scaling

Conditional GANs or style transfer can also be used to generate more data.

**Examples for NLP Data augmentation**

Synonym Replacement

Random Insertion/Deletion

Word Embeddings

**Example for Numerical Data**

SMOTE

## Dropout

Dropout is a regularization technique that zeros out the activation values of randomly chosen neurons during training. Dropout is implemented per layer in a neural network. Different layer types should have different dropouts.

Note: Drop out will increase the number of iteration needed for convergence. But since it reduces computation each iteration is faster.

## Ensemble

Bagging — combine strong learners & smoothen out their predictions

Bagging is a way to decrease the variance in the prediction by generating additional data for training from the dataset using combinations with repetitions to produce multi-sets of the original data. By increasing the size of your training set you just decrease the variance, narrowly tuning the prediction to the expected outcome. Ex. **Random Forest**

Boosting — Combine weak learners into strong learner

Boosting is a way to decrease bias. The subset creation is not random and depends upon the performance of the previous models. Every new subset contains the elements that were (likely to be) misclassified by previous models. Ex. **XGBoost**

## Early stopping

During the model building, the model is evaluated on the separate validation dataset after each epoch. If the validation loss(or any other metric) of the model on the validation dataset starts to degrade, then the training process is stopped. This process is called Early stopping. We save the model weights if the performance of the model on a current dataset is better than at the previous epoch. In the end weights with the best performance are selected. An important point here is to choose the right performance metric that best defines the performance of the model.

Early stopping can be used universally.

## Adding Noise

The addition of noise has a regularization effect and, in turn, improves the robustness of the model. Noise (Gaussian Noise) prevents the model from memorizing training samples. Generally, noise is added at the input layer, but not limited to. It can be added randomly to other parts of the network.

Activations— useful in Deep NetworksWeights —useful with RNNs.Gradients— Useful for Deep fully connected networks

Note: Before adding noise relevant parameters should be scaled.

**Batch Normalization**

Hidden layer activations’ values change over the course of the training. In batch normalization, we normalize each layer’s inputs by using the mean and standard deviation (or variance) of the values in the current batch. Basically, we are doing normalization, not only at the beginning rather, all over the network. Each mini-batch is scaled using its mean and standard deviation. This introduces some noise to each layer, providing a regularization effect.

## L1 (Lasso) Regularization

L1 Regularization is a good choice when the number of features is high, L1 provides a sparse solution, removing features. It is computationally less expensive.

## L2 (Ridge) Regularization

Codependence tends to increase coefficient variance, making coefficients unreliable/unstable, which hurts model generality. Gender and is_pregnant pair of features is an example of codependent features. L2 reduces the variance of these estimates, which counteracts the effect of codependencies.