Review: PReLU-Net — The First to Surpass Human-Level Performance in ILSVRC 2015 (Image Classification)

Sik-Ho Tsang
Coinmonks
6 min readSep 3, 2018

--

In this story, PReLU-Net [1] is reviewed. Parametric Rectified Linear Unit (PReLU) is proposed to generalize the traditional rectified unit (ReLU). This is the deep learning approach which is first to surpass the human-level performance in ILSVRC (ImageNet Large Scale Visual Recognition Challenge) image classification. In addition, a better weight initialization for rectifiers is proposed which can helps with convergence of deep model (30 layers) trained directly from scratch.

Finally, PReLU-Net obtains 4.94% top-5 error rate on test set which is better than the human-level performance of 5.1%, and GoogLeNet of 6.66% !!!

And this is the paper in 2015 ICCV and it got about 3000 citations at the moment I am writing this story. (Sik-Ho Tsang @ Medium)

Dataset

Classification: Over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3M/50k/100k images are used for the training/validation/testing sets.

What are covered

  1. Parametric Rectified Linear Unit (PReLU)
  2. A better weight initialization for rectifiers
  3. 22-layer deep learning models
  4. Comparison with state-of-the-art approaches
  5. Object detection by using Fast R-CNN

1. Parametric Rectified Linear Unit (PReLU)

In AlexNet [2], ReLU is suggested as below where only positive values would pass through the ReLU activation function while all negative values are set to zero. And ReLU outperform Tanh with much faster training speed due to non-saturation at 1.

PReLU

PReLU is suggested that there should be penalty for negative values and it should be parametric.

It is noted that when a = 0, it is ReLU.
When a = 0.01, it is Leaky ReLU.
Now the value of a can be learnt, therefore becoming a generalized ReLU.

During backpropagation, we can estimate the gradient:

Backpropagation, gradient from deep layer (Left), gradient of the activation (Right)

We can estimate the gradient from the deep layer (left), and the gradient of the activation (right). We can see that it is the sum of all positions of the feature map (Channel-wise). If it is channel-shared variant, it is the sum all over the channels of the layer. No weight decay is applied to a.

The average value of a over all channels for each layer

Two interesting phenomena observed:

  1. First, the first conv layer (conv1) has coefficients (0.681 and 0.596) significantly greater than 0. As the filters of conv1 are mostly Gabor-like filters such as edge or texture detectors, the learned results show that both positive and negative responses of the filters are respected.
  2. For the channel-wise version, the deeper conv layers in general have smaller coefficients. the activations gradually become “more nonlinear” at increasing depths. In other words, the learned model tends to keep more information in earlier stages and becomes more discriminative in deeper stages.

2. A better weight initialization for rectifiers

A good weight initialization is essential to not to let the network to reduce or magnifying the input signals exponentially. Since weight initialization depends on Gaussian distribution with mean 0 and different variance depending on the algorithm like Xavier. Thus, by considering the input and output network sizes, a better weight initialization is suggested.

With L layers putting together, the variance is (left):

Variance of L layers (Left), and sufficient condition (right)

If the sufficient condition at the right is met, the network can become stable. Thus, finally, the variance should be 2/n_l where n_l is the number of connection in l-th layer.

(There is also proof at backward propagation case. It is also interesting that it also comes up with the same sufficient condition. But I just not to show it here.)

Red (Ours) and Blue (Xavier), 22-layer (Left) and 30-layer (Right)

As shown above, the suggested weight initialization converges faster. And Xavier even cannot converge for the deeper layer at the right when training from scratch.

3. 22-layer deep learning models

PReLU-Net: Model A, B, C

SPP layer: 4-level SPPNet [3–4] {7×7, 3×3, 2×2, 1×1}

Model A: A model with better result than VGG-19 [5]

Model B: Deep model than Model A

Model C: Wider model (More filters) than Model B

Model A using PReLU is better than the one using ReLU
Model A: PReLU converges faster

4. Comparison with state-of-the-art approaches

Single model, 10-view

By using just single model and 10-view, Model C has 7.38% error rate.

Single model, Multi-view, Multi-scale

With multi-view and multi-scale, Model C has 5.71% error rate. This result is already better than even the multi model of SPPNet [3–4], VGGNet [5] and GoogLeNet [6].

Multi-model, Multi-view, Multi-scale

With multi-model, i.e. 6 models PReLU-Net, got the 4.94% error rate.
This is 26% relative improvement against GoogLeNet!!!

5. Object detection by using Fast R-CNN

PReLU-Net uses Fast R-CNN [7] implementation for object detection in PASCAL VOC 2007 dataset.

Model C + PReLU-Net has the best mAP result

With imagenet pretrained model and fine-tuning on VOC 2007 dataset, Model C obtains better results than VGG-16.

As training of deep learning network takes large amount of time, and also for fair comparison or ablation study, actually, there are many knowledges or techniques built from plenty of prior arts. If interest, please visit my reviews (links at the bottom) for other networks such as AlexNet, VGGNet, SPPNet and GoogLeNet. :)

--

--

Sik-Ho Tsang
Coinmonks

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.