PReLU activation
This paper introduced both the PReLU activation and Kaiming initialization. We will discuss PReLU in this post and Kaiming initialization in next.
Parametric ReLU (PReLU)
ReLU has been one of the keys to the recent successes in deep learning. Its use has lead to better solutions than that of sigmoid. This is partially due to the vanishing gradient problem in case of sigmoid activations. But, we can still improve upon ReLU. LeakyReLU was introduced, which doesn’t zero out the negative inputs as ReLU does. Instead, it multiplies the negative input by a small value (like 0.02) and keeps the positive input as is. But this has shown a negligible increase in the accuracy of our models.
What if we can learn that small value during training so that our activation function can better adapt to the other parameters (like weights and biases). This is where PReLU comes in. We can learn the slope parameter using backpropagation at a negligible increase in the cost of training.
In feed-forward networks, each layer learns a single slope parameter. In CNNs we can learn them for each layer or we can learn them for each channel for each layer. Number of slope parameters to be learned=Number of layers (or sum of all channels in every layer). This is negligible as compared to the number of weights and biases to be learned.
Above, yᵢ is any input on the ith channel and aᵢ is the negative slope which is a learnable parameter.
- if aᵢ=0, f becomes ReLU
- if aᵢ>0, f becomes leaky ReLU
- if aᵢ is a learnable parameter, f becomes PReLU
Above formula can also be written as-
For backpropagation, its gradient is-
I hope you found this post helpful and understood more about PReLU. If you find any error or are still confused, share it in the comments.
References-
- https://arxiv.org/abs/1502.01852
- Images taken from above paper