ReLU in DNN?

In the usual situation, we can see that the ReLU activation function has been wildly used in CNN. However, can we adopt ReLU in the usual DNN?

The above code is a simple example. First, we create the training data. The number of data is 4 whose number of feature is 2 in each row. The target is 1-dimension whose rank is 2. We want to use DNN to solve the regression problem.

Next, we create the DNN by the tensorlayer package. In the origin, we use linear activation function to train the model.

In our observation, it’s not proper to train the model directly. As we can see, the first row has the relative small feature while the last row has the large value. On the other hand, the value of target is close to 1000.

x: [[0 1]
[2 3]
[4 5]
[6 7]]
y: [[ 1.66666667]
[ 3.66666667]
[ 1. ]
[ 1.66666667]]

After I set these data, The Lin’s concept (Professor Hsuan-Tien Lin) appears. In the concept of the regularization, the model has the property of transforming invarience. For example: y = 3x + 6 is equal to the model y = x + 2 . The scaling will not influence the data distribution. What’s more, it can reduce the number of possible hypothesis of the model. As the result, we normalize the data before training. The above quote shows the value after normalization.

In the last part, we train the model whose epoch is 5000. In each epoch, the batch size is 1. And we show the output at final.

  [TL] InputLayer  input_layer: (?, 2)
[TL] DenseLayer fc1: 10 identity
[TL] DenseLayer fc2: 10 identity
[TL] DenseLayer fc3: 1 identity
epoch: 0 cost: 19.5903218985
epoch: 500 cost: 3.65743308887
epoch: 1000 cost: 3.6572703938
epoch: 1500 cost: 3.65721264528
epoch: 2000 cost: 3.65715937363
epoch: 2500 cost: 3.6571074049
epoch: 3000 cost: 3.65705506224
epoch: 3500 cost: 3.65700371703
epoch: 4000 cost: 3.65695027867
epoch: 4500 cost: 3.65689659305
result: [[ 2.39973235]
[ 2.13260746]
[ 1.86548269]
[ 1.59835756]]

The above is the terminal result after running this code. We can see it is hard to converge. In the default case, the tensorlayer adopt linear activation. If we change to ReLU? Does the result get better?

[TL] InputLayer  input_layer: (?, 2)
[TL] DenseLayer fc1: 10 relu
[TL] DenseLayer fc2: 10 relu
[TL] DenseLayer fc3: 1 relu
epoch: 0 cost: 19.7820121646
epoch: 500 cost: 2.86656964204
epoch: 1000 cost: 0.442266186932
epoch: 1500 cost: 0.183394053191
epoch: 2000 cost: 0.0181211773674
epoch: 2500 cost: 2.97475499167e-06
epoch: 3000 cost: 4.26751967098e-11
epoch: 3500 cost: 0.0
epoch: 4000 cost: 0.0
epoch: 4500 cost: 0.0
result: [[ 1.66666663]
[ 3.66666675]
[ 1.00000012]
[ 1.66666663]]

Yes! We get the better performance. In the above quote, we can see the result is very similar to the target we show previously.

[TL] InputLayer  input_layer: (?, 2)
[TL] DenseLayer fc1: 10 relu
[TL] DenseLayer fc2: 10 relu
[TL] DenseLayer fc3: 1 relu
epoch: 0 cost: 20.0
epoch: 500 cost: 20.0
epoch: 1000 cost: 20.0
epoch: 1500 cost: 20.0
epoch: 2000 cost: 20.0
epoch: 2500 cost: 20.0
epoch: 3000 cost: 20.0
epoch: 3500 cost: 20.0
epoch: 4000 cost: 20.0
epoch: 4500 cost: 20.0
result: [[ 0.]
[ 0.]
[ 0.]
[ 0.]]

However, this case will also occur sometimes. As we can see, the value of cost function remains zero without changing. What’s going wrong?

The above image shows the ReLU activation function. The left part is the original function. The interval which larger than 0 is linear, and the other part is whole zero. To our discussion, we should know the derivative of the function to compute the gradient.

But what is the derivative of the ReLU function. It might not be a clear concept. A simple thought is: the derivative is the slope! We observe the left image, and we can see the slope is 1 in the right part. On the other hand, the slope is zero in the left part. So we can plot the right image.

I didn’t know the real reason why the gradient keeps zero sometimes. But I guess is the output problem. If the value of output is negative, then the output is zero. Furthermore, the gradient is also zero in the negative side. So the model cannot become better.

Until now, I didn’t think of the best solution to solve this problem. What’s more, the model might converge to the local minimum which cannot predict very well. Maybe adopting some revised ReLU function is a better idea. (For example, PReLU).

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.