Regularization in Deep Learning

Dharti Dhami
8 min readJan 4, 2019

--

When training a neural network we have to make a lot of decisions as shown in the picture below.

Bias and Variance

By looking at the algorithm’s error on the training set(which we call bias) and on the dev set(which we call variance) we can try different things to improve the algorithm

How to address high Bias ?

  1. Bigger network ie more hidden layers or more hidden units.
  2. Train it longer.
  3. Try some more advanced optimization algorithms/better NN architecture

How to address high variance(aka overfitting) ?

  1. Add more data
  2. Regularization
  3. Search for a better NN architecture

In the modern deep learning/big data era, we can keep training a bigger network to reduce the bias without affecting the variance and we can keep adding more data to reduce the variance without affecting the bias. And if we have both, we can drive both bias and variance down. We have to make sure we regularize it appropriately.

Regularization

Let’s look at a few regularization techniques and we will see how these techniques help to reduce the variance aka overfitting.

  1. L2/L1 Regularization
  2. Dropout Regularization
  3. Other

L2 Regularization for Logistic Regression

To add regularization to the logistic regression, we use lambda which is the regularization parameter. So we add lambda/2m times the norm of w squared(aka L2 regularization).

We regularize just the parameter w and not b because w is a high dimensional parameter vector and maybe we are not fitting all the parameters well, whereas b is just a single number. L1 regularization is also used sometimes instead of L2 norm. If we use L1 regularization, then w will end up being sparse. And what that means is that the w vector will have a lot of zeros in it and can potentially help with compressing the model, because the set of parameters are zero, and you need less memory to store the model. Although, in practice, that doesn’t help much.

Regularization for Neural network

In a neural network, cost function is a function of all of the parameters, w[1], b[1] through w[L], b[L], where capital L is the number of layers in the neural network. And so the cost function is, sum of the losses, summed over the m training examples.

For regularization, we add lambda over 2m of sum over all of the parameters W, the squared norm. Where the norm of a matrix is defined as the sum from i=1 through n[l-1]. Sum from j=1 through n[l], because w is an n[l-1] by n[l] dimensional matrix.

So how do you implement gradient descent with this? Previously, we would complete dw using backprop, where backprop would give us the partial derivative of J with respect to w. And then we update w[l], as w[l]- the learning rate times d w[l]. Now that we’ve added regularization term to the objective, what we do is we take dw and add to it, lambda/m times w.

Why does L2 Regularization help with overfitting ?

Recall what high bias and high variance looks like.

What we did for regularization was add the extra term to the cost function that penalizes the weight matrices from being too large. So why is it that shrinking the L2 norm or the parameters might cause less overfitting? One piece of intuition is that if we crank regularization lambda to be really, really big, then to reduce the cost, network will set the weight matrices W to be reasonably close to zero for a lot of hidden units. That’s basically zeroing out a lot of the impact of the hidden units. And if that’s the case, then the much simplified neural network becomes a much smaller neural network. In fact, it is almost like a logistic regression unit, but stacked most probably as deep. And so that will take us from the overfitting case much closer to the left to other high bias case. But by choosing an intermediate value of lambda we can tune the network/parameters to the just right case in the middle.

Dropout Regularization

With dropout, what we do is go through each of the layers of the network and set some probability of eliminating a node in neural network. So, after the coin tosses, we’ll decide to eliminate some nodes and remove all the outgoing links from that node as well. So we end up with a much smaller network. And then we do back propagation training. And so for each pass of the network(and maybe for each example), we will train with a different set of nodes in the network.

Let’s look at how we implement dropout.

The most common dropout technique is called inverted dropout. Let’s say we want to illustrate this with the 3rd layer in the network l=3.

So, what we are going to do is set a matrix d and d³ is going to be the dropout matrix for the layer 3 which stores the probability whether a hidden unit in that layer will be kept or not. And then we take our activations from the third layer and have element wise multiplication with d3.

If we do this in python, technically d3 will be a boolean array where value is true and false, rather than one and zero. But the multiply operation works and will interpret the true and false values as one and zero.

Then finally, we’re going to take a3 and scale it up by dividing by our keep.prob parameter. This is what’s called the inverted dropout technique. And its effect is that, no matter what we set to keep.prob to, whether it’s 0.8 or 0.9 or even one, by dividing by the keep.prob, it ensures that the expected value of a3 remains the same.

At test time, when we are evaluating the neural network, we don’t apply the dropout and scaling.

Why does Dropout work well with a regularizer?

We learnt that drop-out randomly knocks out units in our network. So it’s as if on every iteration we are working with a smaller neural network, and so it should have a regularizing effect.

Also, from the perspective of a single unit which takes inputs and it needs to generate some meaningful output; with drop out, the inputs can get randomly eliminated. So, what this means is that this unit, can’t rely on any one feature because any one feature could go away at random or any one of its own inputs could go away at random. So the unit will be more motivated to spread out the weights and give a little bit of weight to each of the inputs. And by spreading all the weights, this will tend to have an effect of shrinking the squared norm of the weights. And so, it has similar effect as L2 regularization.

One more detail on implementing drop out is that we can choose different keep_prob value for different layers. If a layer has a lot of parameters, we can have lower keep_prob whereas for layers with only a few parameters, the chances of overfitting is low and so we can keep all the units. The downside is, this gives you even more hyper parameters to search for using cross-validation. One other alternative might be to have some layers where you apply drop out and some layers where you don’t apply drop out and then just have one hyper parameter which is a keep_prob for the layers for which you do apply drop outs.

Many of the first successful implementations of drop outs were to computer vision where the input size is too big but we don’t have a lot of training data/training examples.

The thing to remember is that drop out is a regularization technique and so unless the algorithm is over-fitting, don’t bother to use drop out.

Other Regularization techniques

  • Data Augmentation

If you are over fitting getting more training data can help, but getting more training data can be expensive and sometimes you just can’t get more data. But what you can do is augment your training set by taking image and augmenting it by flipping it horizontally, zooming in or distorting it.

  • Early stopping

In early stopping, we plot the training or classification error or the cost function on the training set and on the dev set. We will find the iteration around where dev set was doing the best and take the parameters from that iteration. This works because we initially start with the parameters w close to zero and as number of iterations increase the value of w usually increase. By early stopping we have smaller value of w similar to L2 regularization.

There is a downside with early stopping. Machine learning comprises of several different steps.

  1. Algorithm to optimize the cost function j (eg gradient descent, adam etc)
  2. Prevent overfitting (ie get more data, regularization).

It’s already very complicated to choose among the space of possible algorithms and hyper parameters. This principle is called orthogonalization.

The main downside of early stopping is that this couples these two tasks. because by stopping gradient decent early, we are sort of breaking whatever we are doing to optimize cost function J and simultaneously trying to not over fit. Rather than using early stopping, one alternative is just use L2 regularization then we can just train the neural network as long as possible(the downside here is we have to try a lot of values of the regularization parameter lambda and hence it becomes computationally expensive). The advantage of early stopping is that running the gradient descent process just once, you get to try out values of small w, mid-size w, and large w, without needing to try a lot of values of the L2 regularization hyperparameter lambda.

--

--

Dharti Dhami

Mom, Tech Enthusiast, Engineering lead @Youtube Music.