PReLU activation

2 min readJul 10, 2019

This paper introduced both the PReLU activation and Kaiming initialization. We will discuss PReLU in this post and Kaiming initialization in next.

Parametric ReLU (PReLU)

ReLU has been one of the keys to the recent successes in deep learning. Its use has lead to better solutions than that of sigmoid. This is partially due to the vanishing gradient problem in case of sigmoid activations. But, we can still improve upon ReLU. LeakyReLU was introduced, which doesn’t zero out the negative inputs as ReLU does. Instead, it multiplies the negative input by a small value (like 0.02) and keeps the positive input as is. But this has shown a negligible increase in the accuracy of our models.

(Left) ReLU, (Middle) LeakyReLU and (Last) PReLU

What if we can learn that small value during training so that our activation function can better adapt to the other parameters (like weights and biases). This is where PReLU comes in. We can learn the slope parameter using backpropagation at a negligible increase in the cost of training.

In feed-forward networks, each layer learns a single slope parameter. In CNNs we can learn them for each layer or we can learn them for each channel for each layer. Number of slope parameters to be learned=Number of layers (or sum of all channels in every layer). This is negligible as compared to the number of weights and biases to be learned.