Neural Network 08 — Regularization
Welcome to the lesson 08 of my Neural Network and Deep Learning lessons series😀. If you miss or want to check out my previous blogs, visit following links.
- Prerequisites
- Logistic Regression is a solid base
- Neural Network Representation
- Activation functions
- Gradient descent for Neural Network
- Deep L-layer neural network
- Setting up your Machine Learning Application
If you are ready, let’s get started. 😎
If you have any experience of working on a Machine Learning project, you may have seen some situations your algorithm performed very well on training set, and you are really happy🙋 about getting 95% or above accuracy. But when you try to test it on test set / validation set, the performance is not good enough🙎. We have discussed similar situation in our previous lesson under Bias and Variance. There we discussed about model is being too complex and overfitting. (On the other hand, our model can be overly simple as well. In that case we face to a situation called underfitting, which our model performs poorly both on training and testing sets.)
When it comes to the solutions for overfitting, there are two methods we can approach.
- Regularization
- Collecting more data into our training set
But… getting more data in, will be a complex process and some situations it is impossible to do so. So, regularization is a very viable options to try on.
Here in this lesson we will be discussing how regularization helps to simplify our model structure and thus to reduce overfitting/decrease variance in the context of Neural Network and Deep Learning.
Let’s try to develop these ideas using Logistic Regression. In Logistic Regression, we try to minimize the cost function J.
L1 Norm (Manhattan Norm / Taxicab Norm)
The L1 Norm of a vector is defined as the sum of absolute values of its components.
L2 Norm (Euclidean Norm)
The L2 Norm of a vector is defined as the square root of sum of squares of its components.
Regularization Neural Network
Let’s add regularization term to the cost function.
The Frobenious Norm also known as the Euclidean Norm or Matrix Norm, is a way to measure the magnitude or the size of a matrix in linear algebra.
Gradient descent with regularization
L2 regularization term penalize large weights or parameters by adding squared sum of all the parameters (excluding bias term) multiplied by a regularization parameter, often denoted by λ.
Why regularization reduces overfitting?
Regularization in machine learning is the process of regularizing, the parameters that constrain, regularizes, or shrink the coefficient estimates towards zero. In other words, this technique discourages learning more complex or flexible model, avoiding the risk of overfitting.
For Neural Network we can reduce W[l] for some of the hidden units and make entire network simpler. It prevents overfitting.
Let’s take tanh function.
Therefore, whole network is a bunch of linear functions which is not a complex model. It prevents overfitting.
Dropout regularization
Dropout regularization is another way of reducing the complexity of a neural network. Here, we go through each of the layers and set some probability of eliminating a node in neural network.
Implementation of Dropout (Inverted Dropout)
Let’s consider layer 3 => l=3
keep_prob = 0.8 # probability of keeping a node
d3 = np.random.randn(a3.shape[0], a3.shape[1]) < keep_prob
# d3 is dropout vector
# 0.8 of chance d3 value to be True
# 0.2 of chance d3 value to be False
a3 = np.multiply(a3, d3) # a3 *= d3
# zeroying out values corresponds to False values in d3
a3 /= keep_prob # scaling up a3
# actual a3 values were scalled down due to previous multiplication operation.
Making predictions at test time
It is important to note that, we do not dropout nodes at test time. That’s why we scaled up a3 values at the end of above implementation to keep original a3 values unchanged.
Understanding dropout
After dropping out some hidden units, the network becomes a simple (not complex) network. As we know simple networks can reduce overfitting.
Let’s consider 1 layer from a network.
In dropout we randomly eliminate input features. Therefore we cannot rely on any specific feature. That means we cannot put too much weight on any input because it could go away anytime.
So, this setup will be more motivate to spread out the weights and give a little bit of weight to each 4 inputs. (similar to shrinking the squared norm of the weights what we saw in L2 regularization.)
We can give variable keep-prob values for different layers as well.
keep-prob = 1 means there is no dropout.
For layers with large number of hidden units, we can use lower value as keep-prob thus, eliminate more units from those complex layers.
In computer vision (CV) people use dropout frequently because, images have lot of pixels that do not contain any useful information in them. So, we can use dropout technique to eliminate these pixel values from inputs.
Drawback of dropout
The cost function J is no longer well defined.
Because, every iteration it looses input features randomly. So, it’s hard to calculate the cost.
One solution of this problem is, first run the algorithm without dropout (keep-prob = 1) and plot the cost function to make sure everything is working fine. And then we can apply dropout.
Other regularization methods
Data Augmentation
If we are overfitting our model, getting more data in can help, but it is a very complex and expensive process. But, we can augment our training set and increase its size.
Eg: Flipping, zooming and rotating an image to make different images.
In this way we can at least double the training set.
Early stopping
When we iterate through our algorithms for larger number of iterations, we expect the errors to be minimize. But in practice, the situation is different.
As the name emphasize, in early stopping, we stop training in the middle of the training process.
Alright! We are done with our lesson 08 🙌. Hope you enjoyed the lesson. See you in the next lesson.
Good Luck!!! Keep Learning!!!