Review: PReLU-Net — The First to Surpass Human-Level Performance in ILSVRC 2015 (Image Classification)
In this story, PReLU-Net [1] is reviewed. Parametric Rectified Linear Unit (PReLU) is proposed to generalize the traditional rectified unit (ReLU). This is the deep learning approach which is first to surpass the human-level performance in ILSVRC (ImageNet Large Scale Visual Recognition Challenge) image classification. In addition, a better weight initialization for rectifiers is proposed which can helps with convergence of deep model (30 layers) trained directly from scratch.
Finally, PReLU-Net obtains 4.94% top-5 error rate on test set which is better than the human-level performance of 5.1%, and GoogLeNet of 6.66% !!!
And this is the paper in 2015 ICCV and it got about 3000 citations at the moment I am writing this story. (Sik-Ho Tsang @ Medium)
Dataset
Classification: Over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3M/50k/100k images are used for the training/validation/testing sets.
What are covered
- Parametric Rectified Linear Unit (PReLU)
- A better weight initialization for rectifiers
- 22-layer deep learning models
- Comparison with state-of-the-art approaches
- Object detection by using Fast R-CNN
1. Parametric Rectified Linear Unit (PReLU)
In AlexNet [2], ReLU is suggested as below where only positive values would pass through the ReLU activation function while all negative values are set to zero. And ReLU outperform Tanh with much faster training speed due to non-saturation at 1.
PReLU is suggested that there should be penalty for negative values and it should be parametric.
It is noted that when a = 0, it is ReLU.
When a = 0.01, it is Leaky ReLU.
Now the value of a can be learnt, therefore becoming a generalized ReLU.
During backpropagation, we can estimate the gradient:
We can estimate the gradient from the deep layer (left), and the gradient of the activation (right). We can see that it is the sum of all positions of the feature map (Channel-wise). If it is channel-shared variant, it is the sum all over the channels of the layer. No weight decay is applied to a.
Two interesting phenomena observed:
- First, the first conv layer (conv1) has coefficients (0.681 and 0.596) significantly greater than 0. As the filters of conv1 are mostly Gabor-like filters such as edge or texture detectors, the learned results show that both positive and negative responses of the filters are respected.
- For the channel-wise version, the deeper conv layers in general have smaller coefficients. the activations gradually become “more nonlinear” at increasing depths. In other words, the learned model tends to keep more information in earlier stages and becomes more discriminative in deeper stages.
2. A better weight initialization for rectifiers
A good weight initialization is essential to not to let the network to reduce or magnifying the input signals exponentially. Since weight initialization depends on Gaussian distribution with mean 0 and different variance depending on the algorithm like Xavier. Thus, by considering the input and output network sizes, a better weight initialization is suggested.
With L layers putting together, the variance is (left):
If the sufficient condition at the right is met, the network can become stable. Thus, finally, the variance should be 2/n_l where n_l is the number of connection in l-th layer.
(There is also proof at backward propagation case. It is also interesting that it also comes up with the same sufficient condition. But I just not to show it here.)
As shown above, the suggested weight initialization converges faster. And Xavier even cannot converge for the deeper layer at the right when training from scratch.
3. 22-layer deep learning models
SPP layer: 4-level SPPNet [3–4] {7×7, 3×3, 2×2, 1×1}
Model A: A model with better result than VGG-19 [5]
Model B: Deep model than Model A
Model C: Wider model (More filters) than Model B
4. Comparison with state-of-the-art approaches
By using just single model and 10-view, Model C has 7.38% error rate.
With multi-view and multi-scale, Model C has 5.71% error rate. This result is already better than even the multi model of SPPNet [3–4], VGGNet [5] and GoogLeNet [6].
With multi-model, i.e. 6 models PReLU-Net, got the 4.94% error rate.
This is 26% relative improvement against GoogLeNet!!!
5. Object detection by using Fast R-CNN
PReLU-Net uses Fast R-CNN [7] implementation for object detection in PASCAL VOC 2007 dataset.
With imagenet pretrained model and fine-tuning on VOC 2007 dataset, Model C obtains better results than VGG-16.
As training of deep learning network takes large amount of time, and also for fair comparison or ablation study, actually, there are many knowledges or techniques built from plenty of prior arts. If interest, please visit my reviews (links at the bottom) for other networks such as AlexNet, VGGNet, SPPNet and GoogLeNet. :)
References
- [2015 ICCV] [PReLU-Net]
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification - [2012 NIPS] [AlexNet]
ImageNet Classification with Deep Convolutional Neural Networks - [2014 ECCV] [SPPNet]
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition - [2015 TPAMI] [SPPNet]
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition - [2015 ICLR] [VGGNet]
Very Deep Convolutional Networks for Large-Scale Image Recognition - [2015] [CVPR] [GoogLeNet]
Going Deeper with Convolutions - [2015 ICCV] [Fast R-CNN]
Fast R-CNN
My Reviews
- Review of AlexNet, CaffeNet — Winner of ILSVRC 2012 (Image Classification)
- Review: SPPNet — 1st Runner Up (Object Detection), 2nd Runner Up (Image Classification) in ILSVRC 2014
- Review: VGGNet — 1st Runner-Up (Image Classification), Winner (Localization) in ILSVRC 2014
- Review: GoogLeNet (Inception v1) — Winner of ILSVRC 2014 (Image Classification)